Как мне получить первые 3 предложения веб-страницы на python?

#python #html #list #beautifulsoup

#python #HTML #Список #beautifulsoup

Вопрос:

У меня есть задание, в котором одна из вещей, которые я могу сделать, это найти первые 3 предложения веб-страницы и отобразить их. Найти текст веб-страницы достаточно просто, но у меня возникают проблемы с пониманием того, как я нахожу первые 3 предложения.

 import requests
from bs4 import BeautifulSoup

url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)

output = ''
blacklist = [
      '[document]',
      'noscript',
      'header',
      'html',
      'meta',
      'head',
      'input',
      'script'
]

for t in text:
  if (t.parent.name not in blacklist):
    output  = '{} '.format(t)

tempout = output.split('.')
for i in range(tempout):
  if (i >= 3):
    tempout.remove(i)

output = '.'.join(tempout)

print(output)

Ответ №1:

Поиск предложений из текста затруднен. Обычно вы ищете символы, которые могут завершить предложение, такие как ‘.’ и ‘!’. Но точка (‘.’) может появиться в середине предложения, например, в аббревиатуре имени человека. Я использую регулярное выражение для поиска точки, за которой следует либо один пробел, либо конец строки, что работает для первых трех предложений, но не для любого произвольного предложения.

 import requests
from bs4 import BeautifulSoup
import re

url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')

paragraphs = soup.select('section.article_text p')
sentences = []
for paragraph in paragraphs:
    matches = re.findall(r'(. ?[.!])(?: |$)', paragraph.text)
    needed = 3 - len(sentences)
    found = len(matches)
    n = min(found, needed)
    for i in range(n):
        sentences.append(matches[i])
    if len(sentences) == 3:
        break
print(sentences)

С принтами:

 ['Many people will land on this page after learning that their email address has appeared in a data breach I've called "Collection #1".', "Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.", "Let's start with the raw numbers because that's the headline, then I'll drill down into where it's from and what it's composed of."]

Ответ №2:

Чтобы очистить первые три предложения, просто добавьте эти строки в ur-код:

 section = soup.find('section',class_ = "article_text post") #Finds the section tag with class "article_text post"

txt = section.p.text #Gets the text within the first p tag within the variable section (the section tag)

print(txt)

Вывод:

 Many people will land on this page after learning that their email address has appeared in a data breach I've called "Collection #1". Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.

Надеюсь, что это поможет!

Ответ №3:

На самом деле, используя beautify soup, вы можете фильтровать по классу «article_text post», видя исходный код:

 myData=soup.find('section',class_ = "article_text post")
print(myData.p.text)

И получить внутренний текст элемента p

Используйте это вместо soup = BeautifulSoup(html_page, 'html.parser')

Вопрос:

Комментарии:

Ответ №1:

Ответ №2:

Ответ №3:

Вам также может понравиться

Lua: ‘pairs’ не выполняет итерацию по [1]

Экспорт таблицы в csv из beeline (cli)

Различный порядок по умолчанию между ORACLE и PostgreSQL