Используйте запросы для загрузки веб-страницы, для которой требуются файлы cookie, во фрейм данных на python

#python-3.x #web-scraping #python-requests

Вопрос:

Я не могу получить приведенный ниже код для навигации по странице отказа от ответственности на веб-сайте, я думаю, что проблема в том, как я пытаюсь собрать файлы cookie.

Я хочу попробовать использовать запросы, а не селен.

 import requests
    import pandas as pd
    from pandas import read_html
    
    # open the page with the disclaimer just to get the cookies
    disclaimer = "https://umm.gassco.no/disclaimer"
    disclaimerdummy = requests.get(disclaimer)
    
    # open the actual page and use the cookies from the fake page opened before
    actualpage = "https://umm.gassco.no/disclaimer/acceptDisclaimer"
    actualpage2 = requests.get(actualpage, cookies=disclaimerdummy.cookies)
    
    # store the content of the actual page in text format
    actualpagetext = (actualpage2.text)
    
    # identify relevant data sources by looking at the 'msgTable' class in the webpage code
    # This is where the tables with the realtime data can be found
    gasscoflow = read_html(actualpagetext, attrs={"class": "msgTable"})
    
    # create the dataframes for the two relevant tables
    Table0 = pd.DataFrame(gasscoflow[0])
    Table1 = pd.DataFrame(gasscoflow[1])
    Table2 = pd.DataFrame(gasscoflow[2])
    Table3 = pd.DataFrame(gasscoflow[3])
    Table4 = pd.DataFrame(gasscoflow[4])

Ответ №1:

После просмотра веб-сайта, прежде всего, на нем всего 2 таблицы, и вы можете использовать сеанс для использования файлов cookie по запросу вместо хранения в переменной, следуйте приведенному ниже коду, чтобы получить все ожидаемые данные.Он печатает только последние 2 строки, поскольку я использовал команду tail, вы можете изменить и получить нужные данные из этих таблиц.

 import requests
import pandas as pd
from pandas import read_html

s=requests.session()
s1=s.get("https://umm.gassco.no")
s2=s.get("https://umm.gassco.no/disclaimer/acceptDisclaimer?")
data = read_html(s2.text, attrs={"class": "msgTable"})
t0 = pd.DataFrame(data[0])
t1 = pd.DataFrame(data[1])

print(t0.tail(2))
print(t1.tail(2))

Выход:

Дайте мне знать, если у вас возникнут какие-либо вопросы 🙂

1. Спасибо, раньше был другой сайт с большим количеством таблиц, но на новом, как вы указали, их всего две. Решение оказалось намного проще, чем я ожидал, спасибо за вашу помощь.