#python #web-scraping #beautifulsoup
#python #очистка веб-страниц #beautifulsoup
Вопрос:
<p class="graytext">2012 Transcripts</p>
<blockquote><p><a title="October 3, 2012 Debate Transcript" href="/voter-education/debate-transcripts/october-3-2012-debate-transcript/">October 3, 2012: The First Obama-Romney Presidential Debate</a></p>
<p><a href="/voter-education/debate-transcripts/october-11-2012-the-biden-romney-vice-presidential-debate/">October 11, 2012: The Biden-Ryan Vice Presidential Debate</a></p>
<p><a href="/voter-education/debate-transcripts/october-16-2012-the-second-obama-romney-presidential-debate/">October 16, 2012: The Second Obama-Romney Presidential Debate</a></p>
<p><a href="/voter-education/debate-transcripts/october-22-2012-the-third-obama-romney-presidential-debate/">October 22, 2012: The Third Obama-Romney Presidential Debate</a></p></blockquote>
<hr />
<p class="graytext">2008 Transcripts</p>
<blockquote><p><a title="September 26, 2008 Debate Transcript" href="/voter-education/debate-transcripts/2008-debate-transcript/">September 26, 2008: The First McCain-Obama Presidential Debate</a></p>
<p><a title="October 2, 2008 Debate Transcript" href="/voter-education/debate-transcripts/2008-debate-transcript-2/">October 2, 2008: The Biden-Palin Vice Presidential Debate</a></p>
<p><a title="October 7, 2008 Debate Transcript" href="/voter-education/debate-transcripts/october-7-2008-debate-transcrip/">October 7, 2008: The Second McCain-Obama Presidential Debate</a></p>
<p><a title="October 15, 2008 Debate Transcript" href="/voter-education/debate-transcripts/october-15-2008-debate-transcript/">October 15, 2008: The Third McCain-Obama Presidential Debate</a></p></blockquote>
<hr />
<p class="graytext">2004 Transcripts</p>
<blockquote><p><a title="October 13, 2004 Debate Transcript" href="/voter-education/debate-transcripts/october-13-2004-debate-transcript/">October 13, 2004: The Third Bush-Kerry Presidential Debate</a></p>
<p><a title="October 8, 2004 Debate Transcript" href="/voter-education/debate-transcripts/october-8-2004-debate-transcript/">October 8, 2004: The Second Bush-Kerry Presidential Debate</a></p>
<p><a title="October 5, 2004 Transcript" href="/voter-education/debate-transcripts/october-5-2004-transcript/">October 5, 2004: The Cheney-Edwards Vice Presidential Debate</a></p>
<p><a title="September 30. 2004 Debate Transcript" href="/voter-education/debate-transcripts/september-30-2004-debate-transcript/">September 30, 2004: The First Bush-Kerry Presidential Debate</a></p></blockquote>
<hr />
<p class="graytext">2000 Transcripts</p>
<blockquote><p><a title="October 3, 2000 Transcript" href="/voter-education/debate-transcripts/october-3-2000-transcript/">October 3, 2000: The First Gore-Bush Presidential Debate</a></p>
<p><a title="October 5, 2000 Debate Transcript" href="/voter-education/debate-transcripts/october-5-2000-debate-transcript/">October 5, 2000: The Lieberman-Cheney Vice Presidential Debate</a></p>
<p><a title="October 11, 2000 Debate Transcript" href="/voter-education/debate-transcripts/october-11-2000-debate-transcript/">October 11, 2000: The Second Gore-Bush Presidential Debate</a></p>
<p><a title="October 17, 2000 Debate Transcript" href="/voter-education/debate-transcripts/october-17-2000-debate-transcript/">October 17, 2000: The Third Gore-Bush Presidential Debate</a></p>
<p><a title="Debate Transcript Translations" href="/voter-education/debate-transcripts/2000-debate-transcripts-translations/">The 2000 Debate Transcripts: Transcripts of the debates translated into six languages</a></p></blockquote>
<hr />
Вопрос в том, чтобы очистить ссылку, которая относится к первым президентским дебатам в 2008 и 2004 годах,
Итак, ответ — это первая ссылка в блоках расшифровок 2008 и 2004 годов, но как мне ее очистить?
Комментарии:
1. на каком языке программирования вы хотите, чтобы это было сделано?.
2. Использование библиотеки python BeautifulSoup
Ответ №1:
Импортируйте красивые зависимости soap.
from bs4 import BeautifulSoup
import re
page = open(html_doc)
soup = BeautifulSoup(page.read())
blockquote = soup.find_all('blockquote')
for anchor in blockquote:
if '2004' in anchor.a['href'] or '2008' in anchor.a['href'] :
print(anchor.a['href'])
Комментарии:
1. Это не работает, так как я хочу очистить определенный блок, а не каждый блок. Я отредактировал вопрос, можете ли вы мне помочь сейчас?
2. @SejalMohata изменил решение в соответствии с модификацией вопроса.
Ответ №2:
Вы можете найти тег p с помощью class graytext
с текстом 2004|2008
и использовать find_next('a')
для получения первой ссылки после этих p
тегов
from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(html,'html.parser')
wanted_p=soup.find_all('p',class_='graytext',text=re.compile('2008|2004'))
for p in wanted_p:
print(p.find_next('a'))
Вывод
<a href="/voter-education/debate-transcripts/2008-debate-transcript/" title="September 26, 2008 Debate Transcript">September 26, 2008: The First McCain-Obama Presidential Debate</a>
<a href="/voter-education/debate-transcripts/october-13-2004-debate-transcript/" title="October 13, 2004 Debate Transcript">October 13, 2004: The Third Bush-Kerry Presidential Debate</a>
Ответ №3:
Учитывая, что вы знаете, какие годы вы хотите, вы можете использовать селекторы атрибутов = значений для таргетинга на соответствующие hrefs с select_one
. select_one
возвращает первое совпадение.
debate2008 = soup.select_one("[href*='2008-debate-transcript']").text
debate2004= soup.select_one("[href*='2004-debate-transcript']").text