#python #html #beautifulsoup
Вопрос:
У меня есть старый HTML
<h1>Health Authority Updates</h1><h2>North America</h2><h3><a id="_US_guidances/regulations"></a>US
guidances/regulations</h3>
<ol>
<li>Final Guidance: 25-May-2021: <a
href="https://www.fda.gov/regulatory-information/search-fda-guidance-documents/emergency-use-authorization-vaccines-prevent-covid-19">Emergency
Use Authorization for Vaccines to Prevent COVID-19: Guidance for Industry</a>
<ol>
<li>abc</li>
<li>def</li>
</ol>
</li>
</ol><h2>Asia-Pacific </h2><h3><a id="_Australia_guidances/regulations"></a>Australia guidances/regulations</h3>
<ol>
<li>Guidance: 04-Sep-2020: <a href="https://www.cortellis.com/intelligence/report/ri/regulatory/238041">Cortellis
Report on In Vitro Diagnostics Regulatory Framework</a>
<ol>
<li>This Regulatory Summary is related to specific Regulation for In Vitro Diagnostics in Australia. It
provides definitions and outlines legal framework from different points of view (manufacturers,
importers and distributors). It gives information about Registration procedures, provides practical help
on how to obtain its notification. This document also contains detailed information about fees, clinical
trials, post-marketing vigilance system, labeling, pricing and reimbursement and advertising.
</li>
<li>Content Update on <strong>04-Sep-2020</strong>:
<ol>
<li>One</li>
<li>Two</li>
<li>three</li>
</ol>
</li>
</ol>
</li>
</ol>
И это новый HTML:
<h2>North America</h2><h3>US guidances/regulations</h3>
<ol>
<li>2021-06-22:<a href=http://www.minsa.gob.pa/noticia/arranca-esperado-proceso-de-vacunacion-en-chiriqui> Emergency
Use Authorization for Vaccines to Prevent weweCOVID-19: Guidance for Industry 22</a>
<ol>
<li> first list</li>
<li> Second</li>
</ol>
</li>
</ol><h2>Asia Pacific</h2><h3>Australia guidances/regulations</h3>
<ol>
<li>2021-06-22:<a href=http://www.minsa.gob.pa/noticia/arranca-esperado-proceso-de-vacunacion-en-chiriqui> Emergency
Use Authorization for Vaccines to Prevent weweCOVID-19: Guidance for Industry 22</a>
<ol>
<li> first list</li>
<li> Second</li>
</ol>
</li>
</ol>
Мне нужно добавить то, что указано в руководстве и правилах США во втором HTML, в начале Руководства/правил США первого HTML, и то же самое касается Австралии. Ниже приведен мой код:
soup1 = BeautifulSoup(html_string, "html.parser")
soup2 = BeautifulSoup(html_string_new, "html.parser")
for li in soup2.select("h3 ol > li"):
h3_text = li.find_previous("h3").get_text(strip=True)
h3_soup1 = soup1.find("h3")
if not h3_soup1:
continue
h3_soup1.find_next("ol").insert(0, li)
The problem is that it inserts everything under US like this
<h1>Health Authority Updates</h1><h2>North America</h2><h3><a id="_US_guidances/regulations"></a>US
guidances/regulations</h3>
<ol>
<li>2021-06-22:<a href="http://www.minsa.gob.pa/noticia/arranca-esperado-proceso-de-vacunacion-en-chiriqui">
Emergency Use Authorization for Vaccines to Prevent weweCOVID-19: Guidance for Industry 22</a>
<ol>
<li> first list</li>
<li> Second</li>
</ol>
</li>
<li>2021-06-22:<a href="http://www.minsa.gob.pa/noticia/arranca-esperado-proceso-de-vacunacion-en-chiriqui">
Emergency Use Authorization for Vaccines to Prevent weweCOVID-19: Guidance for Industry 22</a>
<ol>
<li> first list</li>
<li> Second</li>
</ol>
</li>
<li>Final Guidance: 25-May-2021: <a
href="https://www.fda.gov/regulatory-information/search-fda-guidance-documents/emergency-use-authorization-vaccines-prevent-covid-19">Emergency
Use Authorization for Vaccines to Prevent COVID-19: Guidance for Industry</a>
<ol>
<li>abc</li>
<li>def</li>
</ol>
</li>
</ol><h2>Asia-Pacific </h2><h3><a id="_Australia_guidances/regulations"></a>Australia guidances/regulations</h3>
<ol>
<li>Guidance: 04-Sep-2020: <a href="https://www.cortellis.com/intelligence/report/ri/regulatory/238041">Cortellis
Report on In Vitro Diagnostics Regulatory Framework</a>
<ol>
<li>This Regulatory Summary is related to specific Regulation for In Vitro Diagnostics in Australia. It
provides definitions and outlines legal framework from different points of view (manufacturers,
importers and distributors). It gives information about Registration procedures, provides practical help
on how to obtain its notification. This document also contains detailed information about fees, clinical
trials, post-marketing vigilance system, labeling, pricing and reimbursement and advertising.
</li>
<li>Content Update on <strong>04-Sep-2020</strong>:
<ol>
<li>One</li>
<li>Two</li>
<li>three</li>
</ol>
</li>
</ol>
</li>
</ol>
Я попытался заменить это h3_soup1 = soup1.find("h3")
на это h3_soup1 = soup1.find("h3", text = h3_text)
, но оно возвращается None
.
Редактировать:
Ожидаемый результат:
<h1>Health Authority Updates</h1><h2>North America</h2><h3><a id="_US_guidances/regulations"></a>US
guidances/regulations</h3>
<ol>
<li>2021-06-22:<a href="http://www.minsa.gob.pa/noticia/arranca-esperado-proceso-de-vacunacion-en-chiriqui">
Emergency Use Authorization for Vaccines to Prevent weweCOVID-19: Guidance for Industry 22</a>
<ol>
<li> first list</li>
<li> Second</li>
</ol>
</li>
<li>Final Guidance: 25-May-2021: <a
href="https://www.fda.gov/regulatory-information/search-fda-guidance-documents/emergency-use-authorization-vaccines-prevent-covid-19">Emergency
Use Authorization for Vaccines to Prevent COVID-19: Guidance for Industry</a>
<ol>
<li>abc</li>
<li>def</li>
</ol>
</li>
</ol><h2>Asia-Pacific </h2><h3><a id="_Australia_guidances/regulations"></a>Australia guidances/regulations</h3>
<ol>
<li>2021-06-22:<a href="http://www.minsa.gob.pa/noticia/arranca-esperado-proceso-de-vacunacion-en-chiriqui">
Emergency Use Authorization for Vaccines to Prevent weweCOVID-19: Guidance for Industry 22</a>
<ol>
<li> first list</li>
<li> Second</li>
</ol>
</li>
<li>Guidance: 04-Sep-2020: <a href="https://www.cortellis.com/intelligence/report/ri/regulatory/238041">Cortellis
Report on In Vitro Diagnostics Regulatory Framework</a>
<ol>
<li>This Regulatory Summary is related to specific Regulation for In Vitro Diagnostics in Australia. It
provides definitions and outlines legal framework from different points of view (manufacturers,
importers and distributors). It gives information about Registration procedures, provides practical help
on how to obtain its notification. This document also contains detailed information about fees, clinical
trials, post-marketing vigilance system, labeling, pricing and reimbursement and advertising.
</li>
<li>Content Update on <strong>04-Sep-2020</strong>:
<ol>
<li>One</li>
<li>Two</li>
<li>three</li>
</ol>
</li>
</ol>
</li>
</ol>
Комментарии:
1. Пожалуйста, покажите нам ожидаемый результат
Ответ №1:
Попробуй:
import re
soup1 = BeautifulSoup(html_string, "html.parser")
soup2 = BeautifulSoup(html_string_new, "html.parser")
def fn(txt, tag):
if tag.name != "h3":
return
t = re.sub(r"s{2,}", " ", tag.get_text(strip=True))
return txt in t
for li in soup2.select("h3 ol > li"):
h3_text = li.find_previous("h3").get_text(strip=True)
h3_soup1 = soup1.find(lambda t: fn(h3_text, t))
if not h3_soup1:
continue
h3_soup1.find_next("ol").insert(0, li)
print(soup1)
С принтами:
<h1>Health Authority Updates</h1><h2>North America</h2><h3><a id="_US_guidances/regulations"></a>US
guidances/regulations</h3>
<ol><li>2021-06-22:<a href="http://www.minsa.gob.pa/noticia/arranca-esperado-proceso-de-vacunacion-en-chiriqui"> Emergency
Use Authorization for Vaccines to Prevent weweCOVID-19: Guidance for Industry 22</a>
<ol>
<li> first list</li>
<li> Second</li>
</ol>
</li>
<li>Final Guidance: 25-May-2021: <a href="https://www.fda.gov/regulatory-information/search-fda-guidance-documents/emergency-use-authorization-vaccines-prevent-covid-19">Emergency
Use Authorization for Vaccines to Prevent COVID-19: Guidance for Industry</a>
<ol>
<li>abc</li>
<li>def</li>
</ol>
</li>
</ol><h2>Asia-Pacific </h2><h3><a id="_Australia_guidances/regulations"></a>Australia guidances/regulations</h3>
<ol><li>2021-06-22:<a href="http://www.minsa.gob.pa/noticia/arranca-esperado-proceso-de-vacunacion-en-chiriqui"> Emergency
Use Authorization for Vaccines to Prevent weweCOVID-19: Guidance for Industry 22</a>
<ol>
<li> first list</li>
<li> Second</li>
</ol>
</li>
<li>Guidance: 04-Sep-2020: <a href="https://www.cortellis.com/intelligence/report/ri/regulatory/238041">Cortellis
Report on In Vitro Diagnostics Regulatory Framework</a>
<ol>
<li>This Regulatory Summary is related to specific Regulation for In Vitro Diagnostics in Australia. It
provides definitions and outlines legal framework from different points of view (manufacturers,
importers and distributors). It gives information about Registration procedures, provides practical help
on how to obtain its notification. This document also contains detailed information about fees, clinical
trials, post-marketing vigilance system, labeling, pricing and reimbursement and advertising.
</li>
<li>Content Update on <strong>04-Sep-2020</strong>:
<ol>
<li>One</li>
<li>Two</li>
<li>three</li>
</ol>
</li>
</ol>
</li>
</ol>
Комментарии:
1. Пожалуйста, объясните свою логику. Почему не было супа.найти(«h3», текст = ..) работающий
2. @hannahmontanna Если вы заглянете внутрь
soup1
, текст внутри<h3>
разделится на несколько новых строк. Поэтому, чтобы сравнить<h3>
текст из soup2, мне нужно сначала удалить эти новые строки. Это проще сделать в явной функции.