красивый суп, не предоставляющий надлежащего csv-файла очищенных данных

#python #pandas #csv #web-scraping #beautifulsoup

Вопрос:

Я довольно новичок в веб-скребке, поэтому прошу прощения, если ответ на мою проблему очевиден. Я сделал веб-скребок, который просматривает обзоры игры steam (civilization 6) и получает такую информацию, как часы, потраченные на игру, рекомендовали ли они ее или нет, продукты, которыми они владеют, и так далее.

 import pandas as pd import requests from bs4 import BeautifulSoup as bs  url = "https://steamcommunity.com/app/289070/reviews/?browsefilter=topratedamp;snr=1_5_100010_"  review_dict = {  "found_helpful": [],  "title": [], #recommended or not  "hours": [],  "prods_in_account": [],  "words_in_review": [] }  def data_scrapper():  """  get's the reviews from the steam page.  """  response = requests.get(url)  soup = bs(response.content, "html.parser")  card_div = soup.findAll("div",attrs={"class","apphub_Card modalContentLink interactable"})   for cards in card_div:  found_helpful = cards.find("div", attrs={"class": "found_helpful"})  vote_header = cards.find("div", attrs={"class": "vote_header"})  hours = cards.find("div", attrs={"class": "hours"})  products = cards.find("div", attrs={"class": "apphub_CardContentMoreLink ellipsis"})  words_in_review = cards.find("div", attrs={"class": "apphub_CardTextContent"})   review_dict["found_helpful"].append(found_helpful)  review_dict["title"].append(vote_header)  review_dict["hours"].append(hours)  review_dict["prods_in_account"].append(products)  review_dict["words_in_review"].append(len(words_in_review))  data_scrapper()  review_df = pd.DataFrame.from_dict(review_dict) review_df.to_csv("review.csv", sep=",")  

Моя проблема в том, что когда я запускаю свой код, я ожидаю организованный CSV-файл, однако я получаю это:

 ,found_helpful,title,hours,prods_in_account,words_in_review 0,"lt;div class=""found_helpful""gt;  3,398 people found this review helpfullt;br/gt;159 people found this review funny lt;div class=""review_award_aggregated tooltip"" data-tooltip-class=""review_reward_tooltip"" data-tooltip-html='amp;lt;div class=""review_award_ctn_hover""amp;gt; amp;lt;div class=""review_award"" data-reaction=""6"" data-reactioncount=""5""amp;gt;  amp;lt;img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/6.png?v=5""/amp;gt;  amp;lt;span class=""review_award_count ""amp;gt;5amp;lt;/spanamp;gt;  amp;lt;/divamp;gt;  amp;lt;div class=""review_award"" data-reaction=""3"" data-reactioncount=""3""amp;gt;  amp;lt;img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/3.png?v=5""/amp;gt;  amp;lt;span class=""review_award_count ""amp;gt;3amp;lt;/spanamp;gt;  amp;lt;/divamp;gt;  amp;lt;div class=""review_award"" data-reaction=""5"" data-reactioncount=""2""amp;gt;  amp;lt;img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/5.png?v=5""/amp;gt;  amp;lt;span class=""review_award_count ""amp;gt;2amp;lt;/spanamp;gt;  amp;lt;/divamp;gt;  amp;lt;div class=""review_award"" data-reaction=""1"" data-reactioncount=""1""amp;gt;  amp;lt;img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/1.png?v=5""/amp;gt;  amp;lt;span class=""review_award_count hidden""amp;gt;1amp;lt;/spanamp;gt;  amp;lt;/divamp;gt;  amp;lt;div class=""review_award"" data-reaction=""9"" data-reactioncount=""1""amp;gt;  amp;lt;img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/9.png?v=5""/amp;gt;  amp;lt;span class=""review_award_count hidden""amp;gt;1amp;lt;/spanamp;gt;  amp;lt;/divamp;gt;  amp;lt;div class=""review_award"" data-reaction=""18"" data-reactioncount=""1""amp;gt;  amp;lt;img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/18.png?v=5""/amp;gt;  amp;lt;span class=""review_award_count hidden""amp;gt;1amp;lt;/spanamp;gt;  amp;lt;/divamp;gt;  amp;lt;div class=""review_award"" data-reaction=""19"" data-reactioncount=""1""amp;gt;  amp;lt;img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/19.png?v=5""/amp;gt;  amp;lt;span class=""review_award_count hidden""amp;gt;1amp;lt;/spanamp;gt;  amp;lt;/divamp;gt;  amp;lt;/divamp;gt;'gt;lt;img class=""reward_btn_icon"" src=""https://community.akamai.steamstatic.com/public/shared/images//award_icon_blue.svg""/gt;14lt;/divgt; lt;/divgt;","lt;div class=""vote_header""gt; lt;div class=""reviewInfo""gt; lt;div class=""thumb""gt; lt;img height=""44"" src=""https://community.akamai.steamstatic.com/public/shared/images/userreviews/icon_thumbsDown.png?v=1"" width=""44""/gt; lt;/divgt; lt;div class=""title""gt;Not Recommendedlt;/divgt; lt;div class=""hours""gt;8,028.3 hrs on recordlt;/divgt; lt;/divgt; lt;div style=""clear: left""gt;lt;/divgt; lt;/divgt;","lt;div class=""hours""gt;8,028.3 hrs on recordlt;/divgt;","lt;div class=""apphub_CardContentMoreLink ellipsis""gt;167 products in accountlt;/divgt;",38  

Я пересмотрел свою функцию для извлечения и добавления своих данных, но я все еще получаю этот странный файл, есть какие-нибудь подсказки о том, что я делаю не так?

Комментарии:

1. Как вы можете видеть, found_helpful содержит весь lt;divgt; тег целиком. Вы хотите извлечь текст из этого тега, который находится внутри found_helpful.text .

Ответ №1:

Внесите эти изменения в существующий код:

 for cards in card_div:  found_helpful = cards.find("div", attrs={"class": "found_helpful"}).get_text()  vote_header = cards.find("div", attrs={"class": "vote_header"}).get_text()  hours = cards.find("div", attrs={"class": "hours"}).get_text()  products = cards.find("div", attrs={"class": "apphub_CardContentMoreLink ellipsis"}).get_text()  words_in_review = cards.find("div", attrs={"class": "apphub_CardTextContent"}).get_text()   review_dict["found_helpful"].append(found_helpful)  review_dict["title"].append(vote_header)  review_dict["hours"].append(hours)  review_dict["prods_in_account"].append(products)  review_dict["words_in_review"].append(len(words_in_review))  review_df = pd.DataFrame.from_dict(review_dict) cols = review_df.select_dtypes(['object']).columns review_df[cols] = review_df[cols].apply(lambda x: x.str.strip())  

выход:

 found_helpful title hours prods_in_account words_in_review 0 1,266 people found this review helpful20 peopl... Recommendedn456.9 hrs on record 456.9 hrs on record 536 products in account 770 1 1,127 people found this review helpful14 peopl... Recommendedn92.1 hrs on record 92.1 hrs on record 135 products in account 574 2 853 people found this review helpful49 people ... Recommendedn1,360.8 hrs on record 1,360.8 hrs on record 18 products in account 181 3 1,832 people found this review helpful18 peopl... Recommendedn520.5 hrs on record 520.5 hrs on record 281 products in account 7114 4 3,370 people found this review helpful40 peopl... Not Recommendedn415.7 hrs on record 415.7 hrs on record 102 products in account 853 5 5,724 people found this review helpful172 peop... Not Recommendedn256.7 hrs on record 256.7 hrs on record 180 products in account 2072 6 393 people found this review helpful10 people ... Recommendedn22.8 hrs on record 22.8 hrs on record 85 products in account 278 7 3,229 people found this review helpful62 peopl... Not Recommendedn58.6 hrs on record 58.6 hrs on record 264 products in account 894 8 1,373 people found this review helpful22 peopl... Not Recommendedn195.3 hrs on record 195.3 hrs on record 75 products in account 556 9 3,398 people found this review helpful159 peop... Not Recommendedn8,028.8 hrs on record 8,028.8 hrs on record 167 products in account 8007  

Комментарии:

1. Спасибо за помощь. Это решило половину моей проблемы (правильно сформировал мой CSV-файл и предоставил некоторые данные), мне нужно будет найти правильный HTML-код для извлечения «заголовка» и «найденных полезных» данных.