Извлечение текста из тега html-документа

#python-3.x #beautifulsoup #scrapy #text-extraction #data-extraction

Вопрос:

Я пытаюсь извлечь текст из этих документов(т. е. doc1, doc2.

Мне просто нужен текст внутри заголовка пункта 1.

То, что я пробовал до сих пор, показано ниже

 soup = BS(response.text,'html.parser')

startid = BS(response.css('tr:contains("Itemxa01"), tr:contains("Item 1."), *:contains("ITEM 1")')[0].css('a').get('')).find('a').attrs

endid = BS(response.css('tr:contains("Itemxa02"), tr:contains("Item 2."),*:contains("ITEM 2")')[0].css('a').get('')).find('a').attrs
        
html=''
for tag in soup.select('a',startid)[0].parent.next_siblings:
    if soup.select('a',endid)[0].parent == tag:
        break
    else:
        html  = str(tag)

h = html2text.HTML2Text()
h.ignore_links = True
print(h.handle(html))
 

Мне просто нужен был текст под пунктом 1 части.

Ответ №1:

Если ты побежишь:

 r = requests.get('https://www.sec.gov/Archives/edgar/data/0000001800/000104746915001377/a2222655z10-k.htm')
print(r.text[1532:(1532   571)])
 

На выходе получается:

 To allow for equitable access to all users, SEC reserves the right to limit requests originating from undeclared automated tools. Your request has been identified as part of a network of automated tools outside of the acceptable policy and will be managed until action is taken to declare your traffic.</p>nn<p>Please declare your traffic by updating your user agent to include company specific information.</p>nnn<p>For best practices on efficiently downloading information from SEC.gov, including the latest EDGAR filings, visit <a href="https://www.sec.gov/developer" '
 

Если вы посмотрите на https://www.sec.gov/developer в ссылках на https://www.sec.gov/edgar/sec-api-documentation.

Так что для 0000001800 вас стоит попробовать https://data.sec.gov/submissions/CIK0000001800.json который содержит…

 {"cik":"1800","entityType":"operating","sic":"2834
","sicDescription":"Pharmaceutical Preparations","
insiderTransactionForOwnerExists":1,"insiderTransa
ctionForIssuerExists":1,"name":"ABBOTT LABORATORIE
S","tickers":["ABT"],"exchanges":["NYSE"],"ein":"3
60698440","description":"","website":"","investorW
ebsite":"","category":"Large accelerated filer","f
iscalYearEnd":"1231","stateOfIncorporation":"IL","
stateOfIncorporationDescription":"IL","addresses":
{"mailing":{"street1":"100 ABBOTT PARK ROAD","stre
et2":null,"city":"ABBOTT PARK","stateOrCountry":"I
L","zipCode":"60064-3500","stateOrCountryDescripti
on":"IL"},"business":{"street1":"100 ABBOTT PARK R
OAD","street2":null,"city":"ABBOTT PARK","stateOrC
ountry":"IL","zipCode":"60064-3500","stateOrCountr
yDescription":"IL"}},"phone":"2246676100","flags":
"","formerNames":[],"filings":{"recent":{"accessio
nNumber":["0001415889-21-004019","0001415889-21-00
4018","0001415889-21-003917","0001415889-21-003804
","0001104659-21-100055","0001415889-21-003773","0
001415889-21-003748","0001104659-21-094680","00014
15889-21-003516","0001415889-21-003514","000141588
9-21-003513","0001415889-21-003512","0001415889-21
-003509","0001415889-21-003503","0001415889-21-003
428","0001415889-21-003425","0001415889-21-003423"
,"0001415889-21-003418","0001104659-21-086325","00
01415889-21-002958","0001415889-21-002831","000141
5889-21-002830","0001104659-21-0763........
 

Комментарии:

1. Я использую ту же ссылку для доступа к документам 10-K. Я просто хочу извлечь определенные данные из этого документа. Я ценю ваш ответ, но это не то, что я задал в вопросе.