как собирать текстовые данные из URL-ссылки с или без ".html" в ссылке?

#python #html #url #beautifulsoup

#python #HTML #url #beautifulsoup

Вопрос:

Я пытаюсь собрать некоторые текстовые данные из URL, например https://scikit-learn.org/stable/modules/linear_model.html .

Я хотел бы получить следующие текстовые данные из html

  1.1. Linear Models¶
 The following are a set of methods intended for regression in which the target value is 
 expected to be a linear combination of the features. In mathematical notation, if 
 is the predicted value.

Мой код:

 import urllib
from bs4 import BeautifulSoup
link = "https://scikit-learn.org/stable/modules/linear_model.html"
f = urllib.request.urlopen(link)
html = f.read()
soup = BeautifulSoup(html)
print(soup.prettify())

Как перейти во встроенное тело html, чтобы получить вышеуказанные текстовые данные?

Кроме того, мне нужно проделать аналогичные действия для некоторых ссылок без «.html», я использую тот же код, но из ссылки ничего из текстовых данных не возвращается.

Я не вижу ничего из текстовых данных, когда распечатываю их с помощью

  print(soup.prettify())

Статус возврата

В чем может быть причина?

Спасибо

Ответ №1:

При создании BeautifulSoup объекта вы должны указать анализатор, который вы хотите использовать. Кроме того, я также рекомендую вам использовать requests вместо urllib , но это полностью ваше желание. Вот как вы извлекаете нужный текст:

 div = soup.find('div', class_ = "section") #Finds the div with class section

print(div.h1.text) #Prints the text within the first h1 tag within the div

print(div.p.text) #Prints the text within the first p tag within the div

Вывод:

 1.1. Linear Models¶
The following are a set of methods intended for regression in which
the target value is expected to be a linear combination of the features.
In mathematical notation, if (hat{y}) is the predicted
value.

Вот полный код:

 import urllib
from bs4 import BeautifulSoup
link = "https://scikit-learn.org/stable/modules/linear_model.html"
f = urllib.request.urlopen(link)
html = f.read()
soup = BeautifulSoup(html,'html5lib')

div = soup.find('div', class_ = "section")

print(div.h1.text)

print(div.p.text)