#html #web-scraping #beautifulsoup
Вопрос:
Код:
import urllib.request from bs4 import BeautifulSoup from requests import get import urllib import requests week_11_picURL = "https://www.packers.com/photos/game-photos-packers-at-vikings-week-11-2021#9258618e-e793-41ae-8d9a-d3792366dcbb" response = get(week_11_picURL) print(response) html_page = requests.get(week_11_picURL) soup = BeautifulSoup(html_page.content, 'html.parser') image = soup.findAll('div', class_="nfl-c-photo-album__picture-wrapper")
Результат:
lt;div class="nfl-c-photo-album__picture-wrapper" data-id="146a902d-8de3-484b-ba55-1cf9d26b129c" data-name="Game Photos: Packers at Vikings | Week 11:1"gt; lt;button aria-label="Open Lightbox View" class="nfl-c-photo-album__enlarge-button" title="Open Lightbox View"gt; lt;/buttongt; lt;picturegt;lt;!--[if IE 9]gt;lt;video style="display: none; "gt;lt;![endif]--gt;lt;source media="(min-width:1024px)" srcset="https://static.clubs.nfl.com/image/private/t_new_photo_album/f_auto/packers/hjmcucejx2vmfshjkdkj.jpg 1x, https://static.clubs.nfl.com/image/private/t_new_photo_album_2x/f_auto/packers/hjmcucejx2vmfshjkdkj.jpg 2x, https://static.clubs.nfl.com/image/private/t_new_photo_album_3x/f_auto/packers/hjmcucejx2vmfshjkdkj.jpg"/gt;lt;source media="(min-width:768px)" srcset="https://static.clubs.nfl.com/image/private/t_new_photo_album/f_auto/packers/hjmcucejx2vmfshjkdkj.jpg 1x, https://static.clubs.nfl.com/image/private/t_new_photo_album_2x/f_auto/packers/hjmcucejx2vmfshjkdkj.jpg 2x, https://static.clubs.nfl.com/image/private/t_new_photo_album_3x/f_auto/packers/hjmcucejx2vmfshjkdkj.jpg"/gt;lt;source srcset="https://static.clubs.nfl.com/image/private/t_new_photo_album/f_auto/packers/hjmcucejx2vmfshjkdkj.jpg 1x, https://static.clubs.nfl.com/image/private/t_new_photo_album_2x/f_auto/packers/hjmcucejx2vmfshjkdkj.jpg 2x, https://static.clubs.nfl.com/image/private/t_new_photo_album_3x/f_auto/packers/hjmcucejx2vmfshjkdkj.jpg"/gt;lt;!--[if IE 9]gt;lt;/videogt;lt;![endif]--gt;lt;img alt="211121-game-photos-2560" class="img-responsive" src="https://static.clubs.nfl.com/image/private/t_new_photo_album/t_lazy/f_auto/packers/hjmcucejx2vmfshjkdkj.jpg"/gt;lt;/picturegt; lt;div class="nfl-c-photo-album__picture-info"gt; lt;div class="nfl-c-photo-album__progress"gt; lt;span style=""gt; 1 / 129 lt;/spangt; lt;/divgt; lt;div class="nfl-c-photo-album__football-divider"gt; lt;span class="nfl-o-icon nfl-o-icon--medium"gt; lt;svg aria-hidden="true" class="nfl-o-icon--football" viewbox="0 0 24 24"gt; lt;use xlink:href="#football"gt;lt;/usegt; lt;/svggt; lt;/spangt; lt;/divgt; lt;div class="nfl-c-photo-album__copyright nfl-c-photo-album__copyright--centered"gt; Evan Siegle, packers.com lt;/divgt; lt;/divgt; lt;/divgt; lt;div class="nfl-c-photo-album__picture-wrapper" data-id="27ff497e-e149-45b7-b10a-19baa179e8a1" data-name="Game Photos: Packers at Vikings | Week 11:2"gt; lt;button aria-label="Open Lightbox View" class="nfl-c-photo-album__enlarge-button" title="Open Lightbox View"gt; lt;/buttongt; lt;picture is-lazy="/t_lazy"gt;lt;!--[if IE 9]gt;lt;video style="display: none; "gt;lt;![endif]--gt;lt;source data-srcset="https://static.clubs.nfl.com/image/private/t_new_photo_album/t_lazy/f_auto/packers/rgsvjp6sxu89ditolacv.jpg 1x, https://static.clubs.nfl.com/image/private/t_new_photo_album_2x/t_lazy/f_auto/packers/rgsvjp6sxu89ditolacv.jpg 2x, https://static.clubs.nfl.com/image/private/t_new_photo_album_3x/t_lazy/f_auto/packers/rgsvjp6sxu89ditolacv.jpg" media="(min-width:1024px)"/gt;lt;source data-srcset="https://static.clubs.nfl.com/image/private/t_new_photo_album/t_lazy/f_auto/packers/rgsvjp6sxu89ditolacv.jpg 1x, https://static.clubs.nfl.com/image/private/t_new_photo_album_2x/t_lazy/f_auto/packers/rgsvjp6sxu89ditolacv.jpg 2x, https://static.clubs.nfl.com/image/private/t_new_photo_album_3x/t_lazy/f_auto/packers/rgsvjp6sxu89ditolacv.jpg" media="(min-width:768px)"/gt;lt;source data-srcset="https://static.clubs.nfl.com/image/private/t_new_photo_album/t_lazy/f_auto/packers/rgsvjp6sxu89ditolacv.jpg 1x, https://static.clubs.nfl.com/image/private/t_new_photo_album_2x/t_lazy/f_auto/packers/rgsvjp6sxu89ditolacv.jpg 2x, https://static.clubs.nfl.com/image/private/t_new_photo_album_3x/t_lazy/f_auto/packers/rgsvjp6sxu89ditolacv.jpg"/gt;lt;!--[if IE 9]gt;lt;/videogt;lt;![endif]--gt;lt;img alt="211121-packers-vikings-1st-half-siegle-WM-001" class="img-responsive" data-src="https://static.clubs.nfl.com/image/private/t_new_photo_album/t_lazy/f_auto/packers/rgsvjp6sxu89ditolacv.jpg" src="data:image/gif;base64,R0lGODlhAQABAIAAAP///wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw=="/gt;lt;/picturegt; lt;div class="nfl-c-photo-album__picture-info"gt; lt;div class="nfl-c-photo-album__progress"gt; lt;span style=""gt;
Я хочу иметь возможность просто печатать только ссылки, сгенерированные в результате анализа этого html. Как бы я это сделал?
Говоря конкретно, я пытаюсь выделить ссылку, которая появляется сразу после »
Экс. эта ссылка
«https://static.clubs.nfl.com/image/private/t_new_photo_album/t_lazy/f_auto/packers/rgsvjp6sxu89ditolacv.jpg 1x»
Ответ №1:
Пожалуйста, обратите srcset
data-srcset
внимание, что в вашем супе несколько раз встречается сочетание и, а также источник. Также не используйте findAll()
в новом коде более новый синтаксис find_all()
.
Как это исправить?
Однако вы можете выбрать целевые элементы более конкретно с помощью css selectors
Вариант № 1
Сосредоточен только на источниках с data-srcset
data = [x['data-srcset'].split(',')[0] for x in soup.select('.nfl-c-photo-album__picture-wrapper picture source[data-srcset]:first-child')]
Вариант № 2
Также включите источник с srcset
:
soup.select('.nfl-c-photo-album__picture-wrapper picture source:first-child')
Повторите набор результатов с try
помощью и except
, чтобы избежать ошибок, и добавьте результаты в список:
data = [] for x in soup.select('.nfl-c-photo-album__picture-wrapper picture source:first-child'): try: data.append(x['srcset'].split(',')[0]) except: data.append(x['data-srcset'].split(',')[0])
Пример
import urllib.request from bs4 import BeautifulSoup from requests import get import urllib import requests week_11_picURL = "https://www.packers.com/photos/game-photos-packers-at-vikings-week-11-2021#9258618e-e793-41ae-8d9a-d3792366dcbb" response = get(week_11_picURL) print(response) html_page = requests.get(week_11_picURL) soup = BeautifulSoup(html_page.content, 'html.parser') data = [] for x in soup.select('.nfl-c-photo-album__picture-wrapper picture source:first-child'): try: data.append(x['srcset'].split(',')[0]) except: data.append(x['data-srcset'].split(',')[0]) data
Выход
['https://static.clubs.nfl.com/image/private/t_new_photo_album/f_auto/packers/hjmcucejx2vmfshjkdkj.jpg 1x', 'https://static.clubs.nfl.com/image/private/t_new_photo_album/t_lazy/f_auto/packers/rgsvjp6sxu89ditolacv.jpg 1x', 'https://static.clubs.nfl.com/image/private/t_new_photo_album/t_lazy/f_auto/packers/zsogvqrqgaauqcdgejde.jpg 1x', 'https://static.clubs.nfl.com/image/private/t_new_photo_album/t_lazy/f_auto/packers/jyegqthuab2hsuygirqp.jpg 1x', 'https://static.clubs.nfl.com/image/private/t_new_photo_album/t_lazy/f_auto/packers/kwsq1fvn41f6kzqo4nkl.jpg 1x', 'https://static.clubs.nfl.com/image/private/t_new_photo_album/t_lazy/f_auto/packers/xludbah0g8oqlyvr7d0p.jpg 1x', 'https://static.clubs.nfl.com/image/private/t_new_photo_album/t_lazy/f_auto/packers/n6tkqlr65hv39hadt6tl.jpg 1x', 'https://static.clubs.nfl.com/image/private/t_new_photo_album/t_lazy/f_auto/packers/mhtylxhf2ito5f3y7cb7.jpg 1x', 'https://static.clubs.nfl.com/image/private/t_new_photo_album/t_lazy/f_auto/packers/an8onb7coak1psw7inp5.jpg 1x', 'https://static.clubs.nfl.com/image/private/t_new_photo_album/t_lazy/f_auto/packers/ttas30klcrtagdxnl2af.jpg 1x', ...]