#python #html #html-table #python-requests
Вопрос:
У меня есть список HTML, и он содержит таблицу. Я хотел бы записать все столбцы в список.
Самые свежие данные всегда помечены lt;tr class=""gt;
. Однако я не всегда хочу получать доступ к новейшим данным, а только к определенным данным. Если вы посмотрите на веб-сайт, вы увидите, что там есть данные за каждый месяц.
Теперь я хотел бы сказать, что хотел бы получить данные за август 2021 года. Теперь у меня следующая проблема: на каждый месяц приходится семь файлов. Первые 5 отмечены днем, месяцем и годом. Однако последние два отмечены N/A
, но все равно относятся к одному и тому же дню/месяцу.
Как я могу получить всю информацию за август 2021 года?
import requests import re from bs4 import BeautifulSoup from datetime import datetime DATASET_URL = "http://insideairbnb.com/get-the-data.html" DATASET_CITY = "Antwerp" DATASET_MONTHYEAR = "09.2021" # Converts 29 September, 2021 to 09.2021 def datetimeConverter(d): if(d == 'N/A'): return 'N/A' return datetime.strptime(d, '%m.%Y').strftime('%d %B, %Y') #use requests r = requests.get(DATASET_URL) content = r.content #soup! soup = BeautifulSoup(content, "html.parser") city_table = soup.find(class_=DATASET_CITY.lower()) print(city_table)
table class="table table-hover table-striped antwerp"gt; lt;theadgt; lt;trgt; lt;th class="col-md-3" data-field="host_id"gt;Date Compiledlt;/thgt; lt;th class="col-md-3" data-field="host_id"gt;Country/Citylt;/thgt; lt;th class="col-md-3" data-field="host_id"gt;File Namelt;/thgt; lt;th class="col-md-3" data-align="right" data-field="count"gt; Description lt;/thgt; lt;/trgt; lt;/theadgt; lt;tbodygt; lt;tr class=""gt; lt;tdgt;29 September, 2021lt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz" onclick="var that=this;ga('send','event', 'download','listings',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt;listings.csv.gzlt;/agt;lt;/tdgt; lt;tdgt;Detailed Listings data for Antwerplt;/tdgt; lt;/trgt; lt;tr class=""gt; lt;tdgt;29 September, 2021lt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/calendar.csv.gz" onclick="var that=this;ga('send','event', 'download','calendar',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt;calendar.csv.gzlt;/agt;lt;/tdgt; lt;tdgt;Detailed Calendar Data for listings in Antwerplt;/tdgt; lt;/trgt; lt;tr class=""gt; lt;tdgt;29 September, 2021lt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/reviews.csv.gz" onclick="var that=this;ga('send','event', 'download','reviews',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt;reviews.csv.gzlt;/agt;lt;/tdgt; lt;tdgt;Detailed Review Data for listings in Antwerplt;/tdgt; lt;/trgt; lt;tr class=""gt; lt;tdgt;29 September, 2021lt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/listings.csv" onclick="var that=this;ga('send','event', 'download','listings_visualisation',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt;listings.csvlt;/agt;lt;/tdgt; lt;tdgt;Summary information and metrics for listings in Antwerp (good for visualisations).lt;/tdgt; lt;/trgt; lt;tr class=""gt; lt;tdgt;29 September, 2021lt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/reviews.csv" onclick="var that=this;ga('send','event', 'download','reviews_visualisation',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt; reviews.csvlt;/agt;lt;/tdgt; lt;tdgt;Summary Review data and Listing ID (to facilitate time based analytics and visualisations linked to a listing).lt;/tdgt; lt;/trgt; lt;tr class=""gt; lt;tdgt;N/Alt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/neighbourhoods.csv" onclick="var that=this;ga('send','event', 'download','neighbourhoods',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt;neighbourhoods.csvlt;/agt;lt;/tdgt; lt;tdgt;Neighbourhood list for geo filter. Sourced from city or open source GIS files.lt;/tdgt; lt;/trgt; lt;tr class=""gt; lt;tdgt;N/Alt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/neighbourhoods.geojson" onclick="var that=this;ga('send','event', 'download','geojson',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt;neighbourhoods.geojsonlt;/agt;lt;/tdgt; lt;tdgt;GeoJSON file of neighbourhoods of the city.lt;/tdgt; lt;/trgt; lt;tr class="archived"gt; lt;tdgt;27 August, 2021lt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-08-27/data/listings.csv.gz" onclick="var that=this;ga('send','event', 'download','listings',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt;listings.csv.gzlt;/agt;lt;/tdgt; lt;tdgt;Detailed Listings data for Antwerplt;/tdgt; lt;/trgt; lt;tr class="archived"gt; lt;tdgt;27 August, 2021lt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-08-27/data/calendar.csv.gz" onclick="var that=this;ga('send','event', 'download','calendar',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt;calendar.csv.gzlt;/agt;lt;/tdgt; lt;tdgt;Detailed Calendar Data for listings in Antwerplt;/tdgt; lt;/trgt; lt;tr class="archived"gt; lt;tdgt;27 August, 2021lt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-08-27/data/reviews.csv.gz" onclick="var that=this;ga('send','event', 'download','reviews',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt;reviews.csv.gzlt;/agt;lt;/tdgt; lt;tdgt;Detailed Review Data for listings in Antwerplt;/tdgt; lt;/trgt; lt;tr class="archived"gt; lt;tdgt;27 August, 2021lt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-08-27/visualisations/listings.csv" onclick="var that=this;ga('send','event', 'download','listings_visualisation',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt;listings.csvlt;/agt;lt;/tdgt; lt;tdgt;Summary information and metrics for listings in Antwerp (good for visualisations).lt;/tdgt; lt;/trgt; lt;tr class="archived"gt; lt;tdgt;27 August, 2021lt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-08-27/visualisations/reviews.csv" onclick="var that=this;ga('send','event', 'download','reviews_visualisation',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt; reviews.csvlt;/agt;lt;/tdgt; lt;tdgt;Summary Review data and Listing ID (to facilitate time based analytics and visualisations linked to a listing).lt;/tdgt; lt;/trgt; lt;tr class="archived"gt; lt;tdgt;N/Alt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-08-27/visualisations/neighbourhoods.csv" onclick="var that=this;ga('send','event', 'download','neighbourhoods',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt;neighbourhoods.csvlt;/agt;lt;/tdgt; lt;tdgt;Neighbourhood list for geo filter. Sourced from city or open source GIS files.lt;/tdgt; lt;/trgt; lt;tr class="archived"gt; lt;tdgt;N/Alt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-08-27/visualisations/neighbourhoods.geojson" onclick="var that=this;ga('send','event', 'download','geojson',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt;neighbourhoods.geojsonlt;/agt;lt;/tdgt; lt;tdgt;GeoJSON file of neighbourhoods of the city.lt;/tdgt; lt;/trgt;
What I Want
The list should look like at the end
list= [["http://data.insideairbnb.com/belgium/vlg/antwerp/2021-08-27/data/listings.csv.gz", "listings.csv.gz", "Description", "27 August, 2021"] ,[...], ["http://data.insideairbnb.com/belgium/vlg/antwerp/2021-08-27/visualisations/neighbourhoods.geojson", "neighbourhoods.geojson", "Description", "27 August, 2021"]]
The html what I want
lt;tdgt;27 August, 2021lt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-08-27/data/listings.csv.gz" onclick="var that=this;ga('send','event', 'download','listings',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt;listings.csv.gzlt;/agt;lt;/tdgt; lt;tdgt;Detailed Listings data for Antwerplt;/tdgt; lt;/trgt; lt;tr class="archived"gt; lt;tdgt;27 August, 2021lt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-08-27/data/calendar.csv.gz" onclick="var that=this;ga('send','event', 'download','calendar',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt;calendar.csv.gzlt;/agt;lt;/tdgt; lt;tdgt;Detailed Calendar Data for listings in Antwerplt;/tdgt; lt;/trgt; lt;tr class="archived"gt; lt;tdgt;27 August, 2021lt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-08-27/data/reviews.csv.gz" onclick="var that=this;ga('send','event', 'download','reviews',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt;reviews.csv.gzlt;/agt;lt;/tdgt; lt;tdgt;Detailed Review Data for listings in Antwerplt;/tdgt; lt;/trgt; lt;tr class="archived"gt; lt;tdgt;27 August, 2021lt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-08-27/visualisations/listings.csv" onclick="var that=this;ga('send','event', 'download','listings_visualisation',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt;listings.csvlt;/agt;lt;/tdgt; lt;tdgt;Summary information and metrics for listings in Antwerp (good for visualisations).lt;/tdgt; lt;/trgt; lt;tr class="archived"gt; lt;tdgt;27 August, 2021lt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-08-27/visualisations/reviews.csv" onclick="var that=this;ga('send','event', 'download','reviews_visualisation',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt; reviews.csvlt;/agt;lt;/tdgt; lt;tdgt;Summary Review data and Listing ID (to facilitate time based analytics and visualisations linked to a listing).lt;/tdgt; lt;/trgt; lt;tr class="archived"gt; lt;tdgt;N/Alt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-08-27/visualisations/neighbourhoods.csv" onclick="var that=this;ga('send','event', 'download','neighbourhoods',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt;neighbourhoods.csvlt;/agt;lt;/tdgt; lt;tdgt;Neighbourhood list for geo filter. Sourced from city or open source GIS files.lt;/tdgt; lt;/trgt; lt;tr class="archived"gt; lt;tdgt;N/Alt;/tdgt; lt;tdgt;Antwerplt;/tdgt; lt;tdgt;lt;a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-08-27/visualisations/neighbourhoods.geojson" onclick="var that=this;ga('send','event', 'download','geojson',this.href);setTimeout(function(){location.href=that.href;},200);return false;"gt;neighbourhoods.geojsonlt;/agt;lt;/tdgt; lt;tdgt;GeoJSON file of neighbourhoods of the city.lt;/tdgt; lt;/trgt;
Ответ №1:
Вы должны отфильтровать результат дальше, чтобы получить требуемый результат.
Приведенный ниже код должен дать вам желаемые результаты:
import requests from bs4 import BeautifulSoup from datetime import datetime DATASET_URL = "http://insideairbnb.com/get-the-data.html" DATASET_CITY = "Antwerp" DATASET_MONTHYEAR = "09.2021" # Converts 09.2021 to 2021-09 def datetimeConverter(d): return datetime.strptime(d, '%m.%Y').strftime('%Y-%m') def filter_records_by_date(record): return datetimeConverter(DATASET_MONTHYEAR) in record.find_all('td')[2].a.get('href') # use requests r = requests.get(DATASET_URL) content = r.content # soup! soup = BeautifulSoup(content, "html.parser") city_table = soup.find(class_=DATASET_CITY.lower()).findAll('tr')[1:] filtered_city_table = list(filter(filter_records_by_date, city_table)) query_date = filtered_city_table[0].td.text data = [] for row in filtered_city_table: cells = row.findChildren('td') val = [cells[2].find('a').get("href"), cells[3].text, query_date] data.append(val) print(data) print(len(data))
Выход: [ ['http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz', 'Detailed Listings data for Antwerp', '29 September, 2021'], ['http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/calendar.csv.gz', 'Detailed Calendar Data for listings in Antwerp', '29 September, 2021'], ['http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/reviews.csv.gz', 'Detailed Review Data for listings in Antwerp', '29 September, 2021'], ['http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/listings.csv', 'Summary information and metrics for listings in Antwerp (good for visualisations).', '29 September, 2021'], ['http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/reviews.csv', 'Summary Review data and Listing ID (to facilitate time based analytics and visualisations linked to a listing).', '29 September, 2021'], ['http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/neighbourhoods.csv', 'Neighbourhood list for geo filter. Sourced from city or open source GIS files.', '29 September, 2021'], ['http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/neighbourhoods.geojson', 'GeoJSON file of neighbourhoods of the city.', '29 September, 2021']]
Комментарии:
1. Большое спасибо. Я хотел бы указать только 08.2021. Как я мог тогда сказать, что он должен рассмотреть 27 августа 2021 года, а не сентябрь?
2. Я хочу установить дату и прочитать правильные данные на основе даты. Единственное, чего не хватает, — это возможности установить дату.
3. Вы бы указали «месяц, год» (например, 08.2021) или конкретную дату «27 августа 2021 года»?
4. Я имею в виду, что я ввожу только месяц и год, например
08.2021 (August 2021)
, и затем он должен найти все значения, которые были созданы в августе, и тогда это будет27 August, 2021
. Поэтому следует учитывать только месяц и год, день не важен. Но я хотел бы зачитать разные месяцы.5. Нравится
DATASET_MONTHYEAR = "08.2021"