#python #python-3.x #web-scraping
Вопрос:
Как я мог бы получить и ссылку, и текст из тега H4 в python?
изображение метки h4
У меня есть следующий скрипт, который проходит через разные страницы и загружает данные «класса»:
pages = np.arange(1, 2, 1)
data=[]
for page in pages:
page="https://www.bartonassociates.com/blog/tag/Infographics/p" str(page)
driver = webdriver.Chrome(r"C:UsersssakorkarDesktopchromedriver")
driver.get(page)
sleep(randint(2,10))
soup = BeautifulSoup(driver.page_source, 'html.parser')
my_table = soup.find_all(class_=['author'])
for tag in my_table:
data.append(tag.get_text())
Ответ №1:
Чтобы получить текст ссылку из <h4>
тегов, вы можете использовать следующий пример:
import requests
from bs4 import BeautifulSoup
for page in range(1, 2):
page = "https://www.bartonassociates.com/blog/tag/Infographics/p" str(
page
)
soup = BeautifulSoup(requests.get(page).content, "html.parser")
for h4 in soup.select("h4"):
print(h4.a["href"])
print(h4.get_text(strip=True))
print()
С принтами:
https://www.bartonassociates.com/blog/updated-can-an-np-do-that-infographic
Updated: Can an NP Do That? [INFOGRAPHIC]
https://www.bartonassociates.com/blog/locum-tenens-for-dentists-infographic
Locum Tenens for Dentists [INFOGRAPHIC]
https://www.bartonassociates.com/blog/can-a-crna-do-that-infographic
Can a CRNA Do That? [INFOGRAPHIC]
https://www.bartonassociates.com/blog/surviving-the-physician-shortage-infographic
Surviving the Physician Shortage [INFOGRAPHIC]
https://www.bartonassociates.com/blog/get-the-facts-busting-locum-tenens-myths-infographic
Get the Facts: Busting Locum Tenens Myths [INFOGRAPHIC]
https://www.bartonassociates.com/blog/the-truth-about-medical-billing-infographic
The Truth About Medical Billing [INFOGRAPHIC]
РЕДАКТИРОВАТЬ: Для добавления заголовков/ссылок в список вы можете использовать:
import requests
import pandas as pd
from bs4 import BeautifulSoup
data = []
for page in range(1, 2):
page = "https://www.bartonassociates.com/blog/tag/Infographics/p" str(
page
)
soup = BeautifulSoup(requests.get(page).content, "html.parser")
for h4 in soup.select("h4"):
data.append((h4.get_text(strip=True), h4.a["href"]))
print(data)
С принтами:
[
(
"Updated: Can an NP Do That? [INFOGRAPHIC]",
"https://www.bartonassociates.com/blog/updated-can-an-np-do-that-infographic",
),
(
"Locum Tenens for Dentists [INFOGRAPHIC]",
"https://www.bartonassociates.com/blog/locum-tenens-for-dentists-infographic",
),
(
"Can a CRNA Do That? [INFOGRAPHIC]",
"https://www.bartonassociates.com/blog/can-a-crna-do-that-infographic",
),
(
"Surviving the Physician Shortage [INFOGRAPHIC]",
"https://www.bartonassociates.com/blog/surviving-the-physician-shortage-infographic",
),
(
"Get the Facts: Busting Locum Tenens Myths [INFOGRAPHIC]",
"https://www.bartonassociates.com/blog/get-the-facts-busting-locum-tenens-myths-infographic",
),
(
"The Truth About Medical Billing [INFOGRAPHIC]",
"https://www.bartonassociates.com/blog/the-truth-about-medical-billing-infographic",
),
]
Или создайте фрейм данных из data
:
df = pd.DataFrame(data, columns=["title", "link"])
print(df)
С принтами:
title link
0 Updated: Can an NP Do That? [INFOGRAPHIC] https://www.bartonassociates.com/blog/updated-can-an-np-do-that-infographic
1 Locum Tenens for Dentists [INFOGRAPHIC] https://www.bartonassociates.com/blog/locum-tenens-for-dentists-infographic
2 Can a CRNA Do That? [INFOGRAPHIC] https://www.bartonassociates.com/blog/can-a-crna-do-that-infographic
3 Surviving the Physician Shortage [INFOGRAPHIC] https://www.bartonassociates.com/blog/surviving-the-physician-shortage-infographic
4 Get the Facts: Busting Locum Tenens Myths [INFOGRAPHIC] https://www.bartonassociates.com/blog/get-the-facts-busting-locum-tenens-myths-infographic
5 The Truth About Medical Billing [INFOGRAPHIC] https://www.bartonassociates.com/blog/the-truth-about-medical-billing-infographic
Комментарии:
1. Прошу прощения, я не упоминал об этом раньше. Я хочу, чтобы это было добавлено в пустой список. как я делаю со списком «данные». Это в основном фрейм данных авторов, ссылок, названия блога, который я пытаюсь создать