#python #web-scraping #beautifulsoup
#python #очистка веб-страниц #beautifulsoup
Вопрос:
итак, я пытаюсь проанализировать следующую веб-страницу: [this] [1]. Веб-страница структурирована как дерево, которое я полностью расширил. Теперь мне удалось открыть расширенное дерево с помощью BeautifulSoup следующим образом:
soup =[]
soup.append(BeautifulSoup(website_url,'lxml'))
#print(soup[0].prettify())
но когда появилась структура html, я был весьма озадачен тем, как действовать дальше. В идеале моя цель — очистить имя на последнем листе дерева и справа над кодом. Так, например, если вы перейдете по связанной веб-странице и развернете дерево для A -> A02-> A02B-> A02BC-> A02BC01-> Losec и связанных имен (ссылка), тогда я хотел бы очистить «Losec и связанные имена» и A02BC01. Конечно, делаем то же самое для остальной части дерева.
Структура html выглядит следующим образом:
<html lang="en"><head>
<link type="image/vnd.microsoft.icon" rel="shortcut icon" href="https://ec.europa.eu/health/sites/health/themes/health/favicon.ico">
<meta http-equiv="Cache-Control" content="NO-CACHE">
<meta http-equiv="Content-Language" content="en">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="creator" content="SANTE/DG/UNIT/B5">
<meta name="description" content="Union Register of medicinal products">
<meta name="date" content="15/09/2020">
<meta name="keywords" content="Public Health, European Commission, European Union, EU, Union Register, Medicinal products">
<meta name="reference" content="Union Register of medicinal products">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta property="og:description" content="Union Register of medicinal products">
<meta property="og:site_name" content="Union Register of medicinal products">
<meta property="og:title" content="Public Health - European Commission">
<meta property="og:type" content="website">
<meta property="robots" content="follow,index">
<meta property="revisit-after" content="15 Days">
<title>Union Register of medicinal products - Public health - European Commission</title>
<link rel="stylesheet" type="text/css" href="../datatables/datatables.min.css">
<link rel="stylesheet" type="text/css" href="../jstree/themes/default/style.min.css">
<script type="text/javascript" src="../js/jquery-3.3.1.min.js"></script>
<script type="text/javascript" src="../jstree/jstree.min.js"></script>
<script type="text/javascript" src="../js/register_common.js"></script>
<script type="text/javascript">
// Initialize script parameters.
var exportTitle ="Centralised medicinal products for human use by ATC code";
// Initialise the dataset.
var dataSet = [
{"id":"A","parent":"#","text":"A - Alimentary tract and metabolism"},
{"id":"A02","parent":"A","text":"A02 - Drugs for acid related disorders"},
{"id":"A02B","parent":"A02","text":"A02B - Drugs for treatment of peptic ulcer"},
{"id":"A02BC","parent":"A02B","text":"A02BC - Proton pump inhibitors"},
{"id":"A02BC01","parent":"A02BC","text":"A02BC01 - omeprazole"},
{"id":"ho15861","parent":"A02BC01","text":"Losec and associated names (referral)","type":"pl"},
{"id":"A02BC02","parent":"A02BC","text":"A02BC02 - pantoprazole"},
{"id":"h515","parent":"A02BC02","text":"CONTROLOC Control (active)","type":"pl"},
{"id":"h518","parent":"A02BC02","text":"PANTECTA Control (withdrawn)","type":"pl"},
{"id":"h519","parent":"A02BC02","text":"PANTOLOC Control (active)","type":"pl"},
{"id":"h517","parent":"A02BC02","text":"PANTOZOL Control (active)","type":"pl"},
{"id":"ho15744","parent":"A02BC02","text":"Pantoprazole Bluefish (referral)","type":"pl"},
и продолжается:
</script>
<script type="text/javascript" src="../js/tree_atc.js"></script>
<style type="text/css" media="all">
@import url("../css/health.css");
@import url("../css/europa.css");
@import url("../css/register.css");
</style>
</head>
<body>
<!--
First header section including the main topic selection, the EU logo and the search form.
-->
<header class="ecl-site-header" role="banner">
<div class="ecl-site-switcher ecl-site-switcher--header">
<ul id="ecl-site-switcher-header" class="ecl-site-switcher__list ecl-container">
<li class="ecl-site-switcher__option first"><a href="https://ec.europa.eu/commission/index_en" class="ecl-site-switcher__link ecl-link">Commission and its priorities</a></li>
<li class="ecl-site-switcher__option last ecl-site-switcher__option--is-selected"> <a href="https://ec.europa.eu/info/index_en" class="ecl-site-switcher__link ecl-link">Policies, information and services</a></li>
</ul>
</div>
<div class="ecl-container ecl-site-header__banner">
<a href="https://ec.europa.eu" class="ecl-logo" title="European Commission">
<span class="ecl-u-sr-only">European Commission</span>
</a>
</div>
</header>
<!--
Second header section including the limited breadcrumb and the page main title.
-->
<div class="ecl-page-header">
<div class="ecl-container">
<nav class="ecl-breadcrumbs" aria-label="breadcrumbs">
<div class="item-list">
<ol class="ecl-breadcrumbs__segments-wrapper">
<li class="ecl-breadcrumbs__segment ecl-breadcrumbs__segment--first first"><a href="https://ec.europa.eu/index_en.htm" class="ecl-breadcrumbs__link">European Commission</a></li>
<li class="ecl-breadcrumbs__segment last"><a href="https://ec.europa.eu/info/live-work-travel-eu_en" class="ecl-breadcrumbs__link">Live, work, travel in the EU</a></li>
</ol>
</div>
</nav>
<div class="ecl-page-header__body">
<div class="support"><button class="buttonSupport" title="Contact Union Register support">Union Register support</button></div>
<div class="ecl-page-header__identity">Public Health - Union Register of medicinal products</div>
</div>
</div>
</div>
<!--
Main content section. This is where the Union Register specific content will be displayed.
-->
<div id="mainContainer1" class="health-mainContainer">
<div class="wrapper-fixed">
<div class="ecl-container">
<h1 id="content-title">Centralised medicinal products for human use by ATC code</h1>
<button class="dt-button dtButtonHome" title="Back to Union Register homepage" id="home"><span></span></button>
<button class="dt-button dtButtonExpandAll" title="Expand entire list of ATC codes" id="expand_all"><span></span></button>
<button class="dt-button dtButtonCollapseAll" title="Collapse entire list of ATC codes" id="collapse_all"><span></span></button>
<div id="jstree_atc" class="jstree jstree-1 jstree-default" role="tree" aria-multiselectable="true" tabindex="0" aria-activedescendant="h941" aria-busy="false"><ul class="jstree-container-ul jstree-children jstree-no-dots" role="group"><li role="treeitem" aria-selected="false" aria-level="1" aria-labelledby="A_anchor" aria-expanded="true" id="A" class="jstree-node jstree-open"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="A_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-hidden" role="presentation"></i>A - Alimentary tract and metabolism</a><ul role="group" class="jstree-children"><li role="treeitem" aria-selected="false" aria-level="2" aria-labelledby="A02_anchor" aria-expanded="true" id="A02" class="jstree-node jstree-open"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="A02_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-hidden" role="presentation"></i>A02 - Drugs for acid related disorders</a><ul role="group" class="jstree-children"><li role="treeitem" aria-selected="false" aria-level="3" aria-labelledby="A02B_anchor" aria-expanded="true" id="A02B" class="jstree-node jstree-open jstree-last"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="A02B_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-hidden" role="presentation"></i>A02B - Drugs for treatment of peptic ulcer</a><ul role="group" class="jstree-children"><li role="treeitem" aria-selected="false" aria-level="4" aria-labelledby="A02BC_anchor" aria-expanded="true" id="A02BC" class="jstree-node jstree-open jstree-last"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="A02BC_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-hidden" role="presentation"></i>A02BC - Proton pump inhibitors</a><ul role="group" class="jstree-children"><li role="treeitem" aria-selected="false" aria-level="5" aria-labelledby="A02BC01_anchor" aria-expanded="true" id="A02BC01" class="jstree-node jstree-open"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="A02BC01_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-hidden" role="presentation"></i>A02BC01 - omeprazole</a><ul role="group" class="jstree-children"><li role="treeitem" aria-selected="false" aria-level="6" aria-labelledby="ho15861_anchor" id="ho15861" class="jstree-node jstree-leaf jstree-last"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="ho15861_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-custom" role="presentation" style="background-image: url(amp;quot;https://ec.europa.eu/health/documents/community-register/images/logo_file.pngamp;quot;); background-size: auto; background-position: center center;"></i>Losec and associated names (referral)</a></li></ul></li><li role="treeitem" aria-selected="false" aria-level="5" aria-labelledby="A02BC02_anchor" aria-expanded="true" id="A02BC02" class="jstree-node jstree-open"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="A02BC02_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-hidden" role="presentation"></i>A02BC02 - pantoprazole</a><ul role="group" class="jstree-children"><li role="treeitem" aria-selected="false" aria-level="6" aria-labelledby="h515_anchor" id="h515" class="jstree-node jstree-leaf"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="h515_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-custom" role="presentation" style="background-image: url(amp;quot;https://ec.europa.eu/health/documents/community-register/images/logo_file.pngamp;quot;); background-size: auto; background-position: center center;"></i>CONTROLOC Control (active)</a></li><li role="treeitem" aria-selected="false" aria-level="6" aria-labelledby="h518_anchor" id="h518" class="jstree-node jstree-leaf"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="h518_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-custom" role="presentation" style="background-image: url(amp;quot;https://ec.europa.eu/health/documents/community-register/images/logo_file.pngamp;quot;); background-size: auto; background-position: center center;"></i>PANTECTA Control (withdrawn)</a></li><li role="treeitem" aria-selected="false" aria-level="6" aria-labelledby="h519_anchor" id="h519" class="jstree-node jstree-leaf"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="h519_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-custom" role="presentation" style="background-image: url(amp;quot;https://ec.europa.eu/health/documents/community-register/images/logo_file.pngamp;quot;); background-size: auto; background-position: center center;"></i>PANTOLOC Control (active)</a></li><li role="treeitem" aria-selected="false" aria-level="6" aria-labelledby="h517_anchor" id="h517" class="jstree-node jstree-leaf"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="h517_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-custom" role="presentation" style="background-image: url(amp;quot;https://ec.europa.eu/health/documents/community-register/images/logo_file.pngamp;quot;); background-size: auto; background-position: center center;"></i>PANTOZOL Control (active)</a></li><li role="treeitem" aria-selected="false" aria-level="6" aria-labelledby="ho15744_anchor" id="ho15744" class="jstree-node jstree-leaf"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="ho15744_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-custom" role="presentation" style="background-image: url(amp;quot;https://ec.europa.eu/health/documents/community-register/images/logo_file.pngamp;quot;); background-size: auto; background-position: center center;"></i>Pantoprazole Bluefish (referral)
Есть ли способ очистить то, что я хотел бы?
Комментарии:
1. Может быть, первая структура написана на jave, а вторая часть html предназначена для указания расположения написанного на Java кода на странице? При проверке листьев я получаю:
<a class="jstree-anchor" href="#" tabindex="-1" id="ho15861_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-custom" role="presentation" style="background-image: url(amp;quot;https://ec.europa.eu/health/documents/community-register/images/logo_file.pngamp;quot;); background-size: auto; background-position: center center;"></i>Losec and associated names (referral)</a>
2. У вас есть URL, который вы можете предоставить. немного проще протестировать фактический результат. Из предоставленной информации вам нужно заглянуть в поле «текст», содержащее «(ссылка)», если условие выполнено, то необходимо напечатать всю строку.