получение ссылок из этого элемента | python3 BeautifulSoup4

#python #html #web-scraping #beautifulsoup

#python #HTML #очистка веб-страниц #beautifulsoup

Вопрос:

Прежде всего, я просмотрел Google, и ни один из них не работает. Я пытаюсь получить все ссылки с новостной веб-страницы, поэтому я перечислил элемент ниже, но моя единственная проблема — получить ссылки.

 <section class="featured-category"><article class="post-box">
<div class="post-thumbnail video-play">
<figure class="image-wrapper"><a href="https://news.abs-cbn.com/ancx/culture/music/10/30/20/the-smokey-mountainthirty-years-after">
<img data-src="https://sa.kapamilya.com/absnews/abscbnnews/media/ancx/culture/2020/84/1sm_medium_thumbnail.jpg" width="188" height="125" alt="The Smokey Mountain—thirty years after" class="mp4-animations lazy img-responsive loaded" src="https://sa.kapamilya.com/absnews/abscbnnews/media/ancx/culture/2020/84/1sm_medium_thumbnail.jpg" data-was-processed="true">
</a></figure>
<div class="item-category bottom-left">
<div class="label-text">ANCX</div>
</div>
</div>
<div class="post-content">
<h2 class="post-title"><a href="https://news.abs-cbn.com/ancx/culture/music/10/30/20/the-smokey-mountainthirty-years-after">The Smokey Mountain—thirty years after</a></h2>
</div>
</article>
<article class="post-box">
<div class="post-thumbnail video-play">
<figure class="image-wrapper">
<a href="/news/11/01/20/typhoon-rolly-batters-southern-luzon">
<img data-src="https://sa.kapamilya.com/absnews/abscbnnews/media/2020/news/11/01/20201101-south-luzon-rolly-lucenapolice_medium_thumbnail.jpg" width="188" height="125" alt="Typhoon Rolly batters Southern Luzon" class="mp4-animations lazy img-responsive loaded" src="https://sa.kapamilya.com/absnews/abscbnnews/media/2020/news/11/01/20201101-south-luzon-rolly-lucenapolice_medium_thumbnail.jpg" data-was-processed="true">
</a>
</figure>
<div class="item-category bottom-left">
<div class="label-text">News</div>
</div>
</div>
<div class="post-content">
<h2 class="post-title"><a href="news/11/01/20/typhoon-rolly-batters-southern-luzon">Typhoon Rolly batters Southern Luzon</a></h2>
</div>
</article>
<article class="post-box">
<div class="post-thumbnail video-play">
<figure class="image-wrapper">
<a href="/business/11/01/20/typhoon-rolly-knocks-out-power-in-bicol-parts-of-calabarzon">
<img data-src="https://sa.kapamilya.com/absnews/abscbnnews/media/2020/news/11/01/20201101-typhoon-rolly-cagsawa-amiraflor_medium_thumbnail.jpg" width="188" height="125" alt="Typhoon Rolly knocks out power in Bicol, parts of Calabarzon" class="mp4-animations lazy img-responsive loaded" src="https://sa.kapamilya.com/absnews/abscbnnews/media/2020/news/11/01/20201101-typhoon-rolly-cagsawa-amiraflor_medium_thumbnail.jpg" data-was-processed="true">
</a>
</figure>
<div class="item-category bottom-left">
<div class="label-text">Business</div>
</div>
</div>
<div class="post-content">
<h2 class="post-title"><a href="business/11/01/20/typhoon-rolly-knocks-out-power-in-bicol-parts-of-calabarzon">Typhoon Rolly knocks out power in Bicol, parts of Calabarzon</a></h2>
</div>
</article>
<article class="post-box">
<div class="post-thumbnail video-play">
<figure class="image-wrapper">
<a href="/news/11/01/20/ph-virus-tally-now-at-383113-as-2396-new-cases-confirmed">
<img data-src="https://sa.kapamilya.com/absnews/abscbnnews/media/2020/news/07/11/coronavirus-covid-generic_medium_thumbnail.jpg" width="188" height="125" alt="PH virus tally now at 383,113 as 2,396 new cases confirmed" class="mp4-animations lazy img-responsive loaded" src="https://sa.kapamilya.com/absnews/abscbnnews/media/2020/news/07/11/coronavirus-covid-generic_medium_thumbnail.jpg" data-was-processed="true">
</a>
</figure>
<div class="item-category bottom-left">
<div class="label-text">News</div>
</div>
</div>
<div class="post-content">
<h2 class="post-title"><a href="news/11/01/20/ph-virus-tally-now-at-383113-as-2396-new-cases-confirmed">PH virus tally now at 383,113 as 2,396 new cases confirmed</a></h2>
</div>
</article>
<article class="post-box">
<div class="post-thumbnail video-play">
<figure class="image-wrapper">
<a href="/sports/11/01/20/ahead-of-resumption-of-games-pba-players-test-negative-for-covid-19">
<img data-src="https://sa.kapamilya.com/absnews/abscbnnews/media/2020/news/10/11/pba_medium_thumbnail.jpg" width="188" height="125" alt="Ahead of resumption of games, PBA players test negative for COVID-19" class="mp4-animations lazy img-responsive loaded" src="https://sa.kapamilya.com/absnews/abscbnnews/media/2020/news/10/11/pba_medium_thumbnail.jpg" data-was-processed="true">
</a>
</figure>
<div class="item-category bottom-left">
<div class="label-text">Sports</div>
</div>
</div>
<div class="post-content">
<h2 class="post-title"><a href="sports/11/01/20/ahead-of-resumption-of-games-pba-players-test-negative-for-covid-19">Ahead of resumption of games, PBA players test negative for COVID-19</a></h2>
</div>
</article>
<article class="post-box">
<div class="post-thumbnail video-play">
<figure class="image-wrapper">
<a href="/sports/11/01/20/sportsman-turned-spy-why-sean-connery-chose-james-bond-over-manchester-united">
<img data-src="https://sa.kapamilya.com/absnews/abscbnnews/media/2020/afp/11/01/20201101-seanconnery-ronaldinho-afp_medium_thumbnail.jpg" width="188" height="125" alt="Sportsman turned ‘spy’: Why Sean Connery chose James Bond over Manchester United" class="mp4-animations lazy img-responsive loaded" src="https://sa.kapamilya.com/absnews/abscbnnews/media/2020/afp/11/01/20201101-seanconnery-ronaldinho-afp_medium_thumbnail.jpg" data-was-processed="true">
</a>
</figure>
<div class="item-category bottom-left">
<div class="label-text">Sports</div>
</div>
</div>
<div class="post-content">
<h2 class="post-title"><a href="sports/11/01/20/sportsman-turned-spy-why-sean-connery-chose-james-bond-over-manchester-united">Sportsman turned ‘spy’: Why Sean Connery chose James Bond over Manchester United</a></h2>
</div>
</article>
</section>
  

ПРИМЕР ТОГО, ЧТО Я ПРОБОВАЛ

 
content = soup.find('div', {'class' : "post-content"})

article = ''
for letter in content.findAll("a"):
    print(letter.text)
  

пожалуйста, помогите, я, честно говоря, не знаю, как получить ссылки, известные как значение «href», поскольку я только что попробовал использовать BeautifulSoup сегодня

Ответ №1:

Чтобы распечатать все ссылки на странице, вы можете попробовать это:

 [print(letter['href']) for letter in soup.find_all("a")]
  

Вывод:

 https://news.abs-cbn.com/ancx/culture/music/10/30/20/the-smokey-mountainthirty-years-after
https://news.abs-cbn.com/ancx/culture/music/10/30/20/the-smokey-mountainthirty-years-after
/news/11/01/20/typhoon-rolly-batters-southern-luzon
news/11/01/20/typhoon-rolly-batters-southern-luzon
/business/11/01/20/typhoon-rolly-knocks-out-power-in-bicol-parts-of-calabarzon
business/11/01/20/typhoon-rolly-knocks-out-power-in-bicol-parts-of-calabarzon
/news/11/01/20/ph-virus-tally-now-at-383113-as-2396-new-cases-confirmed
news/11/01/20/ph-virus-tally-now-at-383113-as-2396-new-cases-confirmed
/sports/11/01/20/ahead-of-resumption-of-games-pba-players-test-negative-for-covid-19
sports/11/01/20/ahead-of-resumption-of-games-pba-players-test-negative-for-covid-19
/sports/11/01/20/sportsman-turned-spy-why-sean-connery-chose-james-bond-over-manchester-united
sports/11/01/20/sportsman-turned-spy-why-sean-connery-chose-james-bond-over-manchester-united
  

На веб-сайте также есть ссылки на несколько изображений. Если вы также хотите их распечатать, вы можете добавить эту строку в свой код:

 [print(img['src']) for img in soup.find_all('img')]
  

Вывод:

 https://sa.kapamilya.com/absnews/abscbnnews/media/ancx/culture/2020/84/1sm_medium_thumbnail.jpg
https://sa.kapamilya.com/absnews/abscbnnews/media/2020/news/11/01/20201101-south-luzon-rolly-lucenapolice_medium_thumbnail.jpg
https://sa.kapamilya.com/absnews/abscbnnews/media/2020/news/11/01/20201101-typhoon-rolly-cagsawa-amiraflor_medium_thumbnail.jpg
https://sa.kapamilya.com/absnews/abscbnnews/media/2020/news/07/11/coronavirus-covid-generic_medium_thumbnail.jpg
https://sa.kapamilya.com/absnews/abscbnnews/media/2020/news/10/11/pba_medium_thumbnail.jpg
https://sa.kapamilya.com/absnews/abscbnnews/media/2020/afp/11/01/20201101-seanconnery-ronaldinho-afp_medium_thumbnail.jpg
  

Комментарии:

1. УХ ты! Можете ли вы поддержать мой ответ и принять его как лучший ответ? Спасибо!

Ответ №2:

Вы можете использовать .get('href') :

 content = soup.find('div', {'class' : "post-content"})

article = ''
for letter in content.find_all("a"):
    print(letter.get('href'))
  

Комментарии:

1. Он напечатал только одну из ссылок, но пока все работает хорошо. Есть идеи, как распечатать их все?