Справка BeautifulSoup с несколькими строками из нескольких тегов , найди все исследованные,

#python #beautifulsoup

Вопрос:

Мне нужна помощь с небольшой функцией bs4, мой веб-источник структурирован следующим образом:

 article class=" "
div class=" "
figure id=" "
<p><strong>1.</strong> A string of text </p>
<p><strong>2.</strong> A string of text </p>
<p><strong>3.</strong> A string of text </p>
<p><strong>4.</strong> A string of text </p>
etc..

я пытаюсь извлечь каждую <p> строку текста, игнорируя остальные <p> теги по всей веб-странице.
На данный момент я могу извлекать определенные строки с find_all()[1] помощью , но я хотел бы извлечь несколько строк сразу,
мой код:

 from bs4 import BeautifulSoup
from bs4.element import SoupStrainer
import requests

def getFact(str):
    page = requests.get("https://thoughtcatalog.com/jacob-geers/2016/04/really-funny-random-weird-facts/")    # call webpage
    soup = BeautifulSoup(page.content, 'html.parser')
    soup = soup.find_all('p')[25].text
    print('Fact Selected')

    with open('out.txt', 'w') as f:
        f.write(str(soup))

    with open('out.txt', 'r') as file:
        fact_ = file.read().rstrip('n')

    print(fact_)

getFact(str)

Я могу ввести только 1 целое число find_all , но можно ли выбрать несколько и сохранить в списке?
Я искал документы bs4, google и т. Д. и знаком с общим вводом, но, похоже, не могу найти ничего, что относится к find_all [int options]

1. find_all возвращает массив, поэтому следует выполнить итерацию массива. Посмотрите на bs документирует деталь One common task is extracting all the URLs found within a page’s <a> tags: для примера

Ответ №1:

Вы можете сделать это, сначала выбрав <article> с именем класса как tc_article tc_article-width , а затем найдя все <p> теги с помощью .find_all() метода.

Вот код:

 from bs4 import BeautifulSoup
from bs4.element import SoupStrainer
import requests

def getFact():
    page = requests.get("https://thoughtcatalog.com/jacob-geers/2016/04/really-funny-random-weird-facts/")    # call webpage
    soup = BeautifulSoup(page.content, 'html.parser')
    ps = soup.find('article', class_='tc_article tc_article-width').find_all('p')
    print('Facts:')
    for i in ps:
        print(i.text.strip())

getFact()

 Facts:
1. Most toilets flush in E flat.
2. A raisin dropped in a glass of fresh champagne will bounce up and down continuously from the bottom of the glass to the top.
3. Cap’n Crunch’s full name is Horatio Magellan Crunch.
4. The Vatican City is the country that drinks the most wine per capita at 74 liters per citizen per year.
5. Approximately 40,000 Americans are injured by toilets each year.
.
.
.
25. The US Treasury once considered producing doughnut-shaped coins!