Соскабливание метки h4

#python #python-3.x #web-scraping

Вопрос:

Как я мог бы получить и ссылку, и текст из тега H4 в python?

изображение метки h4

У меня есть следующий скрипт, который проходит через разные страницы и загружает данные «класса»:

 pages = np.arange(1, 2, 1)
data=[]

for page in pages:
    
    page="https://www.bartonassociates.com/blog/tag/Infographics/p"   str(page) 
    driver = webdriver.Chrome(r"C:UsersssakorkarDesktopchromedriver")
    driver.get(page)  
    sleep(randint(2,10))
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    my_table = soup.find_all(class_=['author'])

    for tag in my_table:
        data.append(tag.get_text())
 

Ответ №1:

Чтобы получить текст ссылку из <h4> тегов, вы можете использовать следующий пример:

 import requests
from bs4 import BeautifulSoup


for page in range(1, 2):
    page = "https://www.bartonassociates.com/blog/tag/Infographics/p"   str(
        page
    )
    soup = BeautifulSoup(requests.get(page).content, "html.parser")
    for h4 in soup.select("h4"):
        print(h4.a["href"])
        print(h4.get_text(strip=True))
        print()
 

С принтами:

 https://www.bartonassociates.com/blog/updated-can-an-np-do-that-infographic
Updated: Can an NP Do That? [INFOGRAPHIC]

https://www.bartonassociates.com/blog/locum-tenens-for-dentists-infographic
Locum Tenens for Dentists [INFOGRAPHIC]

https://www.bartonassociates.com/blog/can-a-crna-do-that-infographic
Can a CRNA Do That? [INFOGRAPHIC]

https://www.bartonassociates.com/blog/surviving-the-physician-shortage-infographic
Surviving the Physician Shortage [INFOGRAPHIC]

https://www.bartonassociates.com/blog/get-the-facts-busting-locum-tenens-myths-infographic
Get the Facts: Busting Locum Tenens Myths [INFOGRAPHIC]

https://www.bartonassociates.com/blog/the-truth-about-medical-billing-infographic
The Truth About Medical Billing [INFOGRAPHIC]

 

РЕДАКТИРОВАТЬ: Для добавления заголовков/ссылок в список вы можете использовать:

 import requests
import pandas as pd
from bs4 import BeautifulSoup


data = []
for page in range(1, 2):
    page = "https://www.bartonassociates.com/blog/tag/Infographics/p"   str(
        page
    )
    soup = BeautifulSoup(requests.get(page).content, "html.parser")
    for h4 in soup.select("h4"):
        data.append((h4.get_text(strip=True), h4.a["href"]))


print(data)
 

С принтами:

 [
    (
        "Updated: Can an NP Do That? [INFOGRAPHIC]",
        "https://www.bartonassociates.com/blog/updated-can-an-np-do-that-infographic",
    ),
    (
        "Locum Tenens for Dentists [INFOGRAPHIC]",
        "https://www.bartonassociates.com/blog/locum-tenens-for-dentists-infographic",
    ),
    (
        "Can a CRNA Do That? [INFOGRAPHIC]",
        "https://www.bartonassociates.com/blog/can-a-crna-do-that-infographic",
    ),
    (
        "Surviving the Physician Shortage [INFOGRAPHIC]",
        "https://www.bartonassociates.com/blog/surviving-the-physician-shortage-infographic",
    ),
    (
        "Get the Facts: Busting Locum Tenens Myths [INFOGRAPHIC]",
        "https://www.bartonassociates.com/blog/get-the-facts-busting-locum-tenens-myths-infographic",
    ),
    (
        "The Truth About Medical Billing [INFOGRAPHIC]",
        "https://www.bartonassociates.com/blog/the-truth-about-medical-billing-infographic",
    ),
]
 

Или создайте фрейм данных из data :

 df = pd.DataFrame(data, columns=["title", "link"])
print(df)
 

С принтами:

                                                      title                                                                                        link
0                Updated: Can an NP Do That? [INFOGRAPHIC]                 https://www.bartonassociates.com/blog/updated-can-an-np-do-that-infographic
1                  Locum Tenens for Dentists [INFOGRAPHIC]                 https://www.bartonassociates.com/blog/locum-tenens-for-dentists-infographic
2                        Can a CRNA Do That? [INFOGRAPHIC]                        https://www.bartonassociates.com/blog/can-a-crna-do-that-infographic
3           Surviving the Physician Shortage [INFOGRAPHIC]          https://www.bartonassociates.com/blog/surviving-the-physician-shortage-infographic
4  Get the Facts: Busting Locum Tenens Myths [INFOGRAPHIC]  https://www.bartonassociates.com/blog/get-the-facts-busting-locum-tenens-myths-infographic
5            The Truth About Medical Billing [INFOGRAPHIC]           https://www.bartonassociates.com/blog/the-truth-about-medical-billing-infographic
 

Комментарии:

1. Прошу прощения, я не упоминал об этом раньше. Я хочу, чтобы это было добавлено в пустой список. как я делаю со списком «данные». Это в основном фрейм данных авторов, ссылок, названия блога, который я пытаюсь создать