#python #web-scraping #beautifulsoup
#python #веб-очистка #beautifulsoup
Вопрос:
У меня есть следующий div с id="participant"
:
<div id="participant" class="panel-collapse collapse in" role="tabpanel" aria-expanded="true" aria-labelledby="headingOne" style="">
<div class="panel-body">
<div class="row">
<div class="col-sm-12">
<div class="question-container">
<div class="question-group">
<h5 class="question">
Organisation
</h5>
<div class="answer">
<p>Ministerio de Hacienda [Ministry of Finance]</p>
<p>Consejo de Contadores Públicos del Paraguay (Consejo) [Council of Public Accountants of Paraguay]</p>
<p>Central Bank of Paraguay – Superintendence of Banks</p>
<br>
</div>
</div>
<div class="question-group">
<h5 class="question">
Role of the organisation
</h5>
<div class="answer">
<p>The Ministry of Finance has authority to establish accounting standards for all entities in Paraguay other than banks and financial institutions.amp;nbsp; </p>
<p>The Consejo is the professional association of public accountants in Paraguay.amp;nbsp; The Consejo advises the Ministry of Finance with regard to accounting standards.</p>
<p>Accounting standards for banks and other financial institutions are established by the Central Bank of Paraguay.</p>
</div>
</div>
<div class="question-group">
<h5 class="question">
Website
</h5>
<div class="answer">
<p>Ministry of Finance: <a href="http://www.hacienda.gov.py" target="_blank">http://www.hacienda.gov.py</a></p>
<p>Consejo: <a href="http://www.consejo.com.py" target="_blank">www.consejo.com.py</a></p>
<p>Central Bank: <a href="http://www/bcp.gov.py" target="_blank">http://www/bcp.gov.py</a></p>
</div>
</div>
<div class="question-group">
<h5 class="question">
Email contact
</h5>
<div class="answer">
<p>Consejo: <a href="mailto:consejo@consejo.com.py">consejo@consejo.com.py</a><br>
Central Bank:
</p>
<ul>
<li><a href="mailto:afranco@bcp.gov.py">afranco@bcp.gov.py</a> and <a href="hcentu@bcp.gov.py">hcentu@bcp.gov.py</a></li>
<li><a href="mailto:jjimenez@bcp.gov.py">jjimenez@bcp.gov.py</a></li>
<li><a href="mailto:hcolman@bcp.gov.py">hcolman@bcp.gov.py</a></li>
</ul>
</div>
</div>
</div>
</div>
</div>
Я хочу получить содержимое каждого div с class="question"
помощью и class="answer"
начиная с <div id="participant">
, потому что у меня много div с одинаковой структурой и CSS, чтобы я мог различать их с помощью id
Это мой ожидаемый результат :
Organisation Ministerio de Hacienda [Ministry of Finance]
Consejo de Contadores Públicos del Paraguay (Consejo) [Council of Public Accountants of Paraguay]
Central Bank of Paraguay – Superintendence of Banks
Role of the The Ministry of Finance has authority to establish accounting standards for all entities in Paraguay other than banks and financial institutions.
organisation The Consejo is the professional association of public accountants in Paraguay. The Consejo advises the Ministry of Finance with regard to accounting standards.
Accounting standards for banks and other financial institutions are established by the Central Bank of Paraguay.
Website Ministry of Finance: http://www.hacienda.gov.py
Consejo: www.consejo.com.py
Central Bank: http://www/bcp.gov.py
Emailcontact Consejo: consejo@consejo.com.py
Central Bank:
afranco@bcp.gov.py and hcentu@bcp.gov.py
jjimenez@bcp.gov.py
hcolman@bcp.gov.py
Это моя работа до сих пор :
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
# Site URL
url = "https://www.ifrs.org/use-around-the-world/use-of-ifrs-standards-by-jurisdiction/paraguay"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse HTML code for the entire site
soup = BeautifulSoup(html_content, "lxml")
divs = soup.find_all("div", attrs={"id": "participant"})
disp = []
d=[]
for c in divs : disp.append(c.find('div', attrs={'class': 'question-group'}))
for t in disp : d.append(t.h5.text.strip())
Ответ №1:
Отложив окончательное форматирование печати, должно сработать что-то вроде этого:
questions = [q.text.strip() for q in soup.select('div#participant h5.question') ]
answers = [a.text.strip() for a in soup.select('div#participant div.answer')]
for q, a in zip(questions,answers):
print(q,": ",a)
print('---')
Вывод:
Organisation : Ministerio de Hacienda [Ministry of Finance]
Consejo de Contadores Públicos del Paraguay (Consejo) [Council of Public Accountants of Paraguay]
Central Bank of Paraguay – Superintendence of Banks
---
и т.д.