Selenium — очистка веб-страниц; Как получить определенные теги с помощью selenium?

#python #selenium #web-scraping #beautifulsoup

#python #селен #очистка веб-страниц #beautifulsoup

Вопрос:

Я собираю разные курсы с университетских сайтов.

HTML части сайта:

 <div>
<h2>About the programme</h2>
<p>The Nationalamp;nbsp;Jointamp;nbsp;PhD Programme in Nautical Operationsamp;nbsp;is organised as a joint degree between the following four national higher education institutions offering professional maritime education:</p>
<ul>
    <li>Universtity of Tromsamp;oslash; - The Arctic University of Norway (UiT)</li>
    <li>University ofamp;nbsp;South-Easternamp;nbsp;Norway (USN)</li>
    <li>Western Norway University of Applied Sciences (HVL)</li>
    <li>Norwegian University of Science and Technology (NTNU)</li>
</ul>
<p>
    The Nationalamp;nbsp;Jointamp;nbsp;PhD Programme in Nautical Operations will educate qualified candidates for research, teaching, dissemination and innovation work, and other activities requiring scientific insight and an operational
    maritime focus.amp;nbsp;
</p>
<p>
    Implementation of complex nautical operations today requires interdisciplinarity and differentiated competence, including research expertise, for the safe and efficient planning, implementation and evaluation of nautical
    operations.amp;nbsp;
</p>
<p>The programme has the followingamp;nbsp;vision: to create an internationally recognized national PhD degree in nautical operations.</p>
<p>This vision will be achieved through the following overall objectives:</p>
<ol>
    <li>Strengthen the multidisciplinary national expertise in nautical operations through collaboration between the four higher education institutions in Norway with professional maritime education.</li>
    <li>The PhD Programme in Nautical Operations is the preferred Programme in the field and attracts good applicants nationally and internationally from major maritime nations.</li>
    <li>Individuals graduating from the Programme are in demand both nationally and internationally because they have a strong and relevant research-based expertise and the ability to innovate and adapt.</li>
    <li>Increase value creation and innovation through close cooperation between academia, maritime industry and public sector.</li>
    <li>The multidisciplinary national competence related to nautical operations constitutes an internationally recognised professional environment that sets the terms for the development of knowledge in the field.</li>
</ol>
<h2>Academic content</h2>
<p>Nautical operations consist of two subject areas:</p>
<ul>
    <li>
        Nautical studiesamp;nbsp;that include navigation, maneuvering and transport of floating craft, and operations, indicating that the PhD program will focus on applied research to support, improve and develop the activities
        undertaken.
    </li>
    <li>
        The operational perspectiveamp;nbsp;includes strategic, tactical and operational aspects.amp;nbsp;Strategic levels include the choice of type and size of a ship fleet.amp;nbsp;Tactical aspects concern the design of individual ships and
        the selection of equipment and staff.amp;nbsp;The operational aspects include planning, implementation and evaluation of nautical operations.
    </li>
</ul>
<p>There is a compulsoryamp;nbsp;joint maritime course offered at all the four institutions.</p>
  

Ссылка на сайт:
https://www.usn.no/english/research/postgraduate-studies-phd/our-phd-programmes/nautical-operations /

Я пытаюсь получить текст для course_description / about_the_course и academic_content, как в тегах ‘h2’ выше. Я совершенно не понимаю, как я могу создать обобщенный код для очистки текста тега в соответствии с тегами h2.

Кроме того, я не думаю, что индексация поможет, поскольку порядок тегов <‘p’> и <‘li’> будет меняться от курса к курсу.

Ответ №1:

Вы можете использовать .get_text() с separator='n' :

 import requests
from bs4 import BeautifulSoup


url = 'https://www.usn.no/english/research/postgraduate-studies-phd/our-phd-programmes/nautical-operations/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

desc = soup.find('h2', text=lambda t: 'About the programme' in t)
print( desc.parent.get_text(strip=True, separator='n') )
  

С принтами:

 About the programme
The National Joint PhD Programme in Nautical Operations is organised as a joint degree between the following four national higher education institutions offering professional maritime education:
Universtity of Tromsø
- The Arctic University of Norway (UiT)
University of South-Eastern Norway (USN)
Western Norway University of Applied Sciences
(HVL)
Norwegian University of Science and Technology
(NTNU)
The National Joint PhD Programme in Nautical Operations will educate qualified candidates for research, teaching, dissemination and innovation work, and other activities requiring scientific insight and an operational maritime focus.
Implementation of complex nautical operations today requires interdisciplinarity and differentiated competence, including research expertise, for the safe and efficient planning, implementation and evaluation of nautical operations.
The programme has the following vision: to create an internationally recognized national PhD degree in nautical operations.
This vision will be achieved through the following overall objectives:
Strengthen the multidisciplinary national expertise in nautical operations through collaboration between the four higher education institutions in Norway with professional maritime education.
The PhD Programme in Nautical Operations is the preferred Programme in the field and attracts good applicants nationally and internationally from major maritime nations.
Individuals graduating from the Programme are in demand both nationally and internationally because they have a strong and relevant research-based expertise and the ability to innovate and adapt.
Increase value creation and innovation through close cooperation between academia, maritime industry and public sector.
The multidisciplinary national competence related to nautical operations constitutes an internationally recognised professional environment that sets the terms for the development of knowledge in the field.
Academic content
Nautical operations consist of two subject areas:
Nautical studies that include navigation, maneuvering and transport of floating craft, and operations, indicating that the PhD program will focus on applied research to support, improve and develop the activities undertaken.
The operational perspective includes strategic, tactical and operational aspects. Strategic levels include the choice of type and size of a ship fleet. Tactical aspects concern the design of individual ships and the selection of equipment and staff. The operational aspects include planning, implementation and evaluation of nautical operations.
There is a compulsory joint maritime course offered at all the four institutions.
  

Ответ №2:

На самом деле это очень просто. Просто определите div тег и распечатайте текст внутри него. Вот полный код для этого:

 from bs4 import BeautifulSoup
import requests

r = requests.get('https://www.usn.no/english/research/postgraduate-studies-phd/our-phd-programmes/nautical-operations/').text

soup = BeautifulSoup(r,'html5lib')

div_tag = soup.find('div',class_ = "articleelement newtext contentAbove")

print(div_tag.text)
  

Вывод:

 About the programme
The National Joint PhD Programme in Nautical Operations is organised as a joint degree between the following four national higher education institutions offering professional maritime education:
    Universtity of Tromsø - The Arctic University of Norway (UiT)
    University of South-Eastern Norway (USN)
    Western Norway University of Applied Sciences (HVL)
    Norwegian University of Science and Technology (NTNU)
The National Joint PhD Programme in Nautical Operations will educate qualified candidates for research, teaching, dissemination and innovation work, and other activities requiring scientific insight and an operational maritime focus. 
Implementation of complex nautical operations today requires interdisciplinarity and differentiated competence, including research expertise, for the safe and efficient planning, implementation and evaluation of nautical operations. 
The programme has the following vision: to create an internationally recognized national PhD degree in nautical operations.
This vision will be achieved through the following overall objectives:
    Strengthen the multidisciplinary national expertise in nautical operations through collaboration between the four higher education institutions in Norway with professional maritime education.
    The PhD Programme in Nautical Operations is the preferred Programme in the field and attracts good applicants nationally and internationally from major maritime nations.
    Individuals graduating from the Programme are in demand both nationally and internationally because they have a strong and relevant research-based expertise and the ability to innovate and adapt.
    Increase value creation and innovation through close cooperation between academia, maritime industry and public sector.
    The multidisciplinary national competence related to nautical operations constitutes an internationally recognised professional environment that sets the terms for the development of knowledge in the field.
Academic content
Nautical operations consist of two subject areas:
    Nautical studies that include navigation, maneuvering and transport of floating craft, and operations, indicating that the PhD program will focus on applied research to support, improve and develop the activities undertaken.
    The operational perspective includes strategic, tactical and operational aspects. Strategic levels include the choice of type and size of a ship fleet. Tactical aspects concern the design of individual ships and the selection of equipment and staff. The operational aspects include planning, implementation and evaluation of nautical operations.
There is a compulsory joint maritime course offered at all the four institutions.
  

Это делается для получения текста. Если вы просто хотите получить заголовки, вот полный код:

 from bs4 import BeautifulSoup
import requests

r = requests.get('https://www.usn.no/english/research/postgraduate-studies-phd/our-phd-programmes/nautical-operations/').text

soup = BeautifulSoup(r,'html5lib')

div_tag = soup.find('div',class_ = "articleelement newtext contentAbove")

headings = div_tag.find_all('h2')

for heading in headings:
    print(heading.text)
  

Вывод:

 About the programme
Academic content
  

Надеюсь, что это поможет!

Комментарии:

1. Ваш код был бы безупречен, если бы мне нужно было получить весь текст. Но здесь я пытаюсь получить «О программе» и «Академический контент» отдельно.

2. Вы пытаетесь получить заголовки?

3. Ознакомьтесь с моим последним редактированием. Я обновил способы получения как текста, так и заголовков.

Ответ №3:

Вы можете попробовать это с помощью selenium

 PATH = "./chromedriver"

driver = webdriver.Chrome(PATH)
driver.implicitly_wait(5)

url = "https://www.usn.no/english/research/postgraduate-studies-phd/our-phd-programmes/nautical-operations/"
driver.get(url)

path = "//div[@class='articleelement newtext contentAbove']//h2[contains(text(), 'About the programme')]/following-sibling::p"
about_the_program = driver.find_element_by_xpath(path)

path = "//div[@class='articleelement newtext contentAbove']//h2[contains(text(), 'Academic content')]/following-sibling::p"
academic_content = driver.find_element_by_xpath(path)
  

Здесь вы находите h2 тег с текстом About the programme и / или Academic content . Затем вы выбираете следующий аналог тега, который является тегом. h2 p Если вам нужен брат, который является каким-либо другим тегом, вы можете указать его в пути.

РЕДАКТИРОВАТЬ 1

если вы не знаете, какой тег будет после h2 тега, вы, вероятно, можете попробовать это

 list_of_tags = ['p', 'ul', 'span']

for tag in list_of_tags:
    path = "//div[@class='articleelement newtext contentAbove']//h2[contains(text(), 'About the programme')]/following-sibling::"
    try:
        path = path tag
        element_required = driver.find_element_by_xpath(path)
    except Exception as e:
        print(e)
  

этот код обновит path переменную с каждым тегом в списке. если тег существует внутри div , тогда код извлечет тег, иначе код выведет ошибку.

Комментарии:

1. не знал о «following-sibling». Большое вам спасибо. Но как я могу использовать ‘p’ или ‘ul’ / ‘li’ со следующими-sibling?

2. Проверьте path в коде. вот, у вас есть following-sibling::p . Измените p на любой брат, из которого вы хотите извлечь данные. Имейте в виду, что родственный элемент должен находиться внутри того же элемента, h2 что и .

3. Да, я понял это. Я имел в виду, могу ли я использовать какой-то 'or' с following-sibling:: помощью??

4. Вероятно, вы можете использовать try-except .

5. Я не уверен, что это будет так же просто. Но я отредактировал код с решением этой проблемы. Не лучшее из решений, но должно работать