Синтаксический анализ данных openURL со страниц Википедии

#python #parsing

#python #синтаксический анализ

Вопрос:

Я пытаюсь получить все данные о цитировании с любой заданной страницы Википедии. Просматривая страницу Википедии, я вижу, что большая часть информации, которая мне требуется, хранится в объекте openURL в промежутке в справочном разделе страницы.

Формат диапазона, как показано ниже:

 <span 
    title="ctx_ver=Z39.88-2004amp;amp;
    rft_val_fmt=info:ofi/fmt:kev:mtx:journalamp;amp;
    rft.genre=unknownamp;amp;
    rft.jtitle=The Tennesseanamp;amp;
    rft.atitle=Belmont University awarded final 2020 presidential debateamp;amp;
    rft.date=2019-10-11amp;amp;
    rft.aulast=Tamburinamp;amp;
    rft.aufirst=Adamamp;amp;
    rft_id=https://www.tennessean.com/story/news/2019/10/11/belmont-university-nashville-hosts-presidential-debate-2020/3941983002/amp;amp;
    rfr_id=info:sid/en.wikipedia.org:2020 United States presidential election" 

    class="Z3988">
</span>

До сих пор мне удавалось извлекать все с span помощью BeautifulSoup и извлекать заголовки, содержащие данные. Однако я в тупике, когда дело доходит до анализа текста в title поле. Меня особенно интересует rft.atitle , rft.date и rft_id

 import requests
from bs4 import BeautifulSoup


session = requests.Session()
targetWikiPage = "https://en.wikipedia.org/wiki/2020_Beirut_explosion"

if "wikipedia" in targetWikiPage:
    html = session.post(targetWikiPage)
    bsObj = BeautifulSoup(html.text, "html.parser")

    html = session.post(targetWikiPage)
    bsObj = BeautifulSoup(html.text, "html.parser")


    wikiReferences = bsObj.find_all('span', {'class': 'Z3988'})
    wikiReferencesBS = BeautifulSoup(str(wikiReferences), "html.parser")

    for span in wikiReferencesBS.find_all():
        title = span['title']
        print(title)

Частичное решение

Это решение предоставляет функцию, которая принимает строку и два флага. Начало строки, которую мы хотим проанализировать, и конец первого экземпляра флага end .

Проблема, с которой я сейчас сталкиваюсь, заключается в unboundLocalError

 Traceback (most recent call last):
  File "coinscraper.py", line 33, in <module>
    print(extractstring(title,flag1='rft.atitle=', flag2='amp;'))
  File "coinscraper.py", line 17, in extractstring
    return(string)
UnboundLocalError: local variable 'string' referenced before assignment

Модификация

 import requests
from bs4 import BeautifulSoup
import re


session = requests.Session()
targetWikiPage = "https://en.wikipedia.org/wiki/2020_Beirut_explosion"


def extractstring(line,flag1, flag2):
    if flag1 in line: # $ is the flag
        dex1=line.index(flag1)
        subline=line[dex1 len(flag1):-1] #leave out flag ( 1) to end of line
        dex2=subline.index(flag2)
        string=subline[0:dex2].strip() #does not include last flag, strip whitespace
        string = urllib.parse.unquote_plus(string)

    return(string)

if "wikipedia" in targetWikiPage:
    html = session.post(targetWikiPage)
    bsObj = BeautifulSoup(html.text, "html.parser")

    html = session.post(targetWikiPage)
    bsObj = BeautifulSoup(html.text, "html.parser")


    wikiReferences = bsObj.find_all('span', {'class': 'Z3988'})
    wikiReferencesBS = BeautifulSoup(str(wikiReferences), "html.parser")

    for span in wikiReferencesBS.find_all():
        title = span['title']

        print(extractstring(title,flag1='rft.atitle=', flag2='amp;'))

Ответ №1:

Я бы подошел к этому так:

 from urllib.parse import unquote

import requests
from bs4 import BeautifulSoup

targetWikiPage = "https://en.wikipedia.org/wiki/2020_Beirut_explosion"

response = requests.get(targetWikiPage).text
soup = BeautifulSoup(response, "html.parser").find_all('span', {'class': 'Z3988'})


def get_rfts():
    for i in soup:
        for rft in i['title'].split("amp;"):
            yield rft


keep = ["rft.atitle", "rft.date", "rft_id"]
for rft in get_rfts():
    rft_key, rft_value = rft.split("=")
    if rft_key in keep:
        print(unquote(rft_value).replace(" ", " "))

Вывод:

 'Endemic corruption' caused Beirut blast, says Diab: Live updates
https://www.aljazeera.com/news/2020/08/beirut-police-fire-tear-gas-protesters-regroup-live-updates-200810010528285.html
Lebanon's government 'to resign over blast'
2020-08-10
https://www.bbc.com/news/world-middle-east-53720383
Beirut Explosion Generates Seismic Waves Equivalent Of A Magnitude 3.3 Earthquake
https://www.forbes.com/sites/davidbressan/2020/08/06/beirut-port-explosion-triggers-magnitude-3-earthquake/
Many injured as large blast rocks Beirut
2020-08-04
https://www.bbc.co.uk/news/world-middle-east-53656220
Beirut explosion 'one of the largest non-nuclear blasts in history'
2020-08-05
https://www.standard.co.uk/news/world/beirut-explosion-one-of-largest-blasts-history-a4517646.html
Second day of protests as anger over Beirut explosion grows: Live
https://www.aljazeera.com/news/2020/08/hundreds-protesters-injured-anger-simmers-beirut-live-200808234355971.html
Clashes Erupt in Beirut at Blast Protest as Lebanon's Anger Boils Over
...

Вопрос:

Частичное решение

Ответ №1:

Вам также может понравиться

Как определить функцию, использующую цикл и возвращающую несколько значений

Выполнить второе условие, если результат первого запроса не найден в MongoDB

Почему метод remove() jQuery принимает селектор?