Синтаксический анализ веб-страницы с древовидной структурой

#python #web-scraping #beautifulsoup

#python #очистка веб-страниц #beautifulsoup

Вопрос:

итак, я пытаюсь проанализировать следующую веб-страницу: [this] [1]. Веб-страница структурирована как дерево, которое я полностью расширил. Теперь мне удалось открыть расширенное дерево с помощью BeautifulSoup следующим образом:

 soup =[]
soup.append(BeautifulSoup(website_url,'lxml'))
#print(soup[0].prettify())
  

но когда появилась структура html, я был весьма озадачен тем, как действовать дальше. В идеале моя цель — очистить имя на последнем листе дерева и справа над кодом. Так, например, если вы перейдете по связанной веб-странице и развернете дерево для A -> A02-> A02B-> A02BC-> A02BC01-> Losec и связанных имен (ссылка), тогда я хотел бы очистить «Losec и связанные имена» и A02BC01. Конечно, делаем то же самое для остальной части дерева.

Структура html выглядит следующим образом:

 <html lang="en"><head>
        <link type="image/vnd.microsoft.icon" rel="shortcut icon" href="https://ec.europa.eu/health/sites/health/themes/health/favicon.ico">
        <meta http-equiv="Cache-Control" content="NO-CACHE">
        <meta http-equiv="Content-Language" content="en">
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        <meta name="creator" content="SANTE/DG/UNIT/B5">
        <meta name="description" content="Union Register of medicinal products">
        <meta name="date" content="15/09/2020">
        <meta name="keywords" content="Public Health, European Commission, European Union, EU, Union Register, Medicinal products">
        <meta name="reference" content="Union Register of medicinal products">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <meta property="og:description" content="Union Register of medicinal products">
        <meta property="og:site_name" content="Union Register of medicinal products">
        <meta property="og:title" content="Public Health - European Commission">
        <meta property="og:type" content="website">
        <meta property="robots" content="follow,index">
        <meta property="revisit-after" content="15 Days">
        <title>Union Register of medicinal products - Public health - European Commission</title>
      
        <link rel="stylesheet" type="text/css" href="../datatables/datatables.min.css">
        <link rel="stylesheet" type="text/css" href="../jstree/themes/default/style.min.css">
        <script type="text/javascript" src="../js/jquery-3.3.1.min.js"></script>
        <script type="text/javascript" src="../jstree/jstree.min.js"></script>
        <script type="text/javascript" src="../js/register_common.js"></script>
        <script type="text/javascript">
            // Initialize script parameters.
            var exportTitle ="Centralised medicinal products for human use by ATC code";

            // Initialise the dataset.
            var dataSet = [
{"id":"A","parent":"#","text":"A - Alimentary tract and metabolism"},
{"id":"A02","parent":"A","text":"A02 - Drugs for acid related disorders"},
{"id":"A02B","parent":"A02","text":"A02B - Drugs for treatment of peptic ulcer"},
{"id":"A02BC","parent":"A02B","text":"A02BC - Proton pump inhibitors"},
{"id":"A02BC01","parent":"A02BC","text":"A02BC01 - omeprazole"},
{"id":"ho15861","parent":"A02BC01","text":"Losec and associated names (referral)","type":"pl"},
{"id":"A02BC02","parent":"A02BC","text":"A02BC02 - pantoprazole"},
{"id":"h515","parent":"A02BC02","text":"CONTROLOC Control (active)","type":"pl"},
{"id":"h518","parent":"A02BC02","text":"PANTECTA Control (withdrawn)","type":"pl"},
{"id":"h519","parent":"A02BC02","text":"PANTOLOC Control (active)","type":"pl"},
{"id":"h517","parent":"A02BC02","text":"PANTOZOL Control (active)","type":"pl"},
{"id":"ho15744","parent":"A02BC02","text":"Pantoprazole Bluefish (referral)","type":"pl"},

  

и продолжается:

 </script>
       <script type="text/javascript" src="../js/tree_atc.js"></script>
       
       <style type="text/css" media="all">
           @import url("../css/health.css");
           @import url("../css/europa.css");
           @import url("../css/register.css");
       </style>
   </head>

   <body>
       <!--
           First header section including the main topic selection, the EU logo and the search form.
       -->
       <header class="ecl-site-header" role="banner">
           <div class="ecl-site-switcher ecl-site-switcher--header">
               <ul id="ecl-site-switcher-header" class="ecl-site-switcher__list ecl-container">
                   <li class="ecl-site-switcher__option first"><a href="https://ec.europa.eu/commission/index_en" class="ecl-site-switcher__link ecl-link">Commission and its priorities</a></li>
                   <li class="ecl-site-switcher__option last ecl-site-switcher__option--is-selected">  <a href="https://ec.europa.eu/info/index_en" class="ecl-site-switcher__link ecl-link">Policies, information and services</a></li>
               </ul>
           </div>

           <div class="ecl-container ecl-site-header__banner">
               <a href="https://ec.europa.eu" class="ecl-logo" title="European Commission">
                   <span class="ecl-u-sr-only">European Commission</span>
               </a>
           </div>
       </header>

       <!--
           Second header section including the limited breadcrumb and the page main title.
       -->
       <div class="ecl-page-header">
           <div class="ecl-container">
               <nav class="ecl-breadcrumbs" aria-label="breadcrumbs">
                   <div class="item-list">
                       <ol class="ecl-breadcrumbs__segments-wrapper">
                           <li class="ecl-breadcrumbs__segment ecl-breadcrumbs__segment--first first"><a href="https://ec.europa.eu/index_en.htm" class="ecl-breadcrumbs__link">European Commission</a></li>
                           <li class="ecl-breadcrumbs__segment last"><a href="https://ec.europa.eu/info/live-work-travel-eu_en" class="ecl-breadcrumbs__link">Live, work, travel in the EU</a></li>
                       </ol>
                   </div>
               </nav>
               <div class="ecl-page-header__body">
                   <div class="support"><button class="buttonSupport" title="Contact Union Register support">Union Register support</button></div>                     
                   <div class="ecl-page-header__identity">Public Health - Union Register of medicinal products</div>
               </div>
           </div>
       </div>

       <!--
           Main content section. This is where the Union Register specific content will be displayed.
       -->
       <div id="mainContainer1" class="health-mainContainer">
           <div class="wrapper-fixed">
               <div class="ecl-container">
                   <h1 id="content-title">Centralised medicinal products for human use by ATC code</h1>
                   <button class="dt-button dtButtonHome" title="Back to Union Register homepage" id="home"><span></span></button>
                   <button class="dt-button dtButtonExpandAll" title="Expand entire list of ATC codes" id="expand_all"><span></span></button>
                   <button class="dt-button dtButtonCollapseAll" title="Collapse entire list of ATC codes" id="collapse_all"><span></span></button>
                   <div id="jstree_atc" class="jstree jstree-1 jstree-default" role="tree" aria-multiselectable="true" tabindex="0" aria-activedescendant="h941" aria-busy="false"><ul class="jstree-container-ul jstree-children jstree-no-dots" role="group"><li role="treeitem" aria-selected="false" aria-level="1" aria-labelledby="A_anchor" aria-expanded="true" id="A" class="jstree-node  jstree-open"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="A_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-hidden" role="presentation"></i>A - Alimentary tract and metabolism</a><ul role="group" class="jstree-children"><li role="treeitem" aria-selected="false" aria-level="2" aria-labelledby="A02_anchor" aria-expanded="true" id="A02" class="jstree-node  jstree-open"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="A02_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-hidden" role="presentation"></i>A02 - Drugs for acid related disorders</a><ul role="group" class="jstree-children"><li role="treeitem" aria-selected="false" aria-level="3" aria-labelledby="A02B_anchor" aria-expanded="true" id="A02B" class="jstree-node  jstree-open jstree-last"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="A02B_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-hidden" role="presentation"></i>A02B - Drugs for treatment of peptic ulcer</a><ul role="group" class="jstree-children"><li role="treeitem" aria-selected="false" aria-level="4" aria-labelledby="A02BC_anchor" aria-expanded="true" id="A02BC" class="jstree-node  jstree-open jstree-last"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="A02BC_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-hidden" role="presentation"></i>A02BC - Proton pump inhibitors</a><ul role="group" class="jstree-children"><li role="treeitem" aria-selected="false" aria-level="5" aria-labelledby="A02BC01_anchor" aria-expanded="true" id="A02BC01" class="jstree-node  jstree-open"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="A02BC01_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-hidden" role="presentation"></i>A02BC01 - omeprazole</a><ul role="group" class="jstree-children"><li role="treeitem" aria-selected="false" aria-level="6" aria-labelledby="ho15861_anchor" id="ho15861" class="jstree-node  jstree-leaf jstree-last"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="ho15861_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-custom" role="presentation" style="background-image: url(amp;quot;https://ec.europa.eu/health/documents/community-register/images/logo_file.pngamp;quot;); background-size: auto; background-position: center center;"></i>Losec and associated names (referral)</a></li></ul></li><li role="treeitem" aria-selected="false" aria-level="5" aria-labelledby="A02BC02_anchor" aria-expanded="true" id="A02BC02" class="jstree-node  jstree-open"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="A02BC02_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-hidden" role="presentation"></i>A02BC02 - pantoprazole</a><ul role="group" class="jstree-children"><li role="treeitem" aria-selected="false" aria-level="6" aria-labelledby="h515_anchor" id="h515" class="jstree-node  jstree-leaf"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="h515_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-custom" role="presentation" style="background-image: url(amp;quot;https://ec.europa.eu/health/documents/community-register/images/logo_file.pngamp;quot;); background-size: auto; background-position: center center;"></i>CONTROLOC Control (active)</a></li><li role="treeitem" aria-selected="false" aria-level="6" aria-labelledby="h518_anchor" id="h518" class="jstree-node  jstree-leaf"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="h518_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-custom" role="presentation" style="background-image: url(amp;quot;https://ec.europa.eu/health/documents/community-register/images/logo_file.pngamp;quot;); background-size: auto; background-position: center center;"></i>PANTECTA Control (withdrawn)</a></li><li role="treeitem" aria-selected="false" aria-level="6" aria-labelledby="h519_anchor" id="h519" class="jstree-node  jstree-leaf"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="h519_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-custom" role="presentation" style="background-image: url(amp;quot;https://ec.europa.eu/health/documents/community-register/images/logo_file.pngamp;quot;); background-size: auto; background-position: center center;"></i>PANTOLOC Control (active)</a></li><li role="treeitem" aria-selected="false" aria-level="6" aria-labelledby="h517_anchor" id="h517" class="jstree-node  jstree-leaf"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="h517_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-custom" role="presentation" style="background-image: url(amp;quot;https://ec.europa.eu/health/documents/community-register/images/logo_file.pngamp;quot;); background-size: auto; background-position: center center;"></i>PANTOZOL Control (active)</a></li><li role="treeitem" aria-selected="false" aria-level="6" aria-labelledby="ho15744_anchor" id="ho15744" class="jstree-node  jstree-leaf"><i class="jstree-icon jstree-ocl" role="presentation"></i><a class="jstree-anchor" href="#" tabindex="-1" id="ho15744_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-custom" role="presentation" style="background-image: url(amp;quot;https://ec.europa.eu/health/documents/community-register/images/logo_file.pngamp;quot;); background-size: auto; background-position: center center;"></i>Pantoprazole Bluefish (referral)
  

Есть ли способ очистить то, что я хотел бы?

Комментарии:

1. Может быть, первая структура написана на jave, а вторая часть html предназначена для указания расположения написанного на Java кода на странице? При проверке листьев я получаю: <a class="jstree-anchor" href="#" tabindex="-1" id="ho15861_anchor"><i class="jstree-icon jstree-themeicon jstree-themeicon-custom" role="presentation" style="background-image: url(amp;quot;https://ec.europa.eu/health/documents/community-register/images/logo_file.pngamp;quot;); background-size: auto; background-position: center center;"></i>Losec and associated names (referral)</a>

2. У вас есть URL, который вы можете предоставить. немного проще протестировать фактический результат. Из предоставленной информации вам нужно заглянуть в поле «текст», содержащее «(ссылка)», если условие выполнено, то необходимо напечатать всю строку.