#python-3.x #dom #beautifulsoup #html-parsing
Вопрос:
Я использую BeautifulSoup для анализа некоторых html-страниц. Я хочу получить всю текстовую информацию в <p>
тегах по этой <div id="commentary">
ссылке на изображение того содержимого html-скрипта, которое я хочу получить
Когда я использую find_all
, чтобы получить все <p>
теги, список содержит только первый. Я использовал следующий код, чтобы подсчитать количество <p>
тегов, присутствующих ниже . <div>
На приведенном выше изображении вы можете ясно видеть, что в этом выделенном теге около 19 <p>
<div>
тегов, но мой код все равно выводит 1.
content = soup.find('div', attrs={'class':'company-profile'})
points = content.find('div', attrs={'id':'commentary'})
count = 0
for point in points.find_all('p'):
count = count 1
print(count)
print(points.text)
Я не знаю, почему это происходит и почему find_all
метод не возвращает полный список.
Я также попытался использовать функцию points.text
для печати всего текста внутри <div id="commentary">
тега, но она печатает только содержимое первого <p>
тега.
(mlenv) chirag@debian10:~/ML/Finaments$ python main.py
<class 'bs4.element.Tag'>
State Bank of India is a Fortune 500 company. It is an Indian Multinational, Public Sector banking and financial services statutory body headquartered in Mumbai. It is the largest and oldest bank in India with over 200 years of history.#
1
1
Ratios (Q3FY21)
Capital Adequacy Ratio - 14.50%
Net Interest Margin - 3.34%
Gross NPA - 4.77%
Net NPA - 1.23%
CASA Ratio - 45.15%#
(mlenv) chirag@debian10:~/ML/Finaments$ ^C
(mlenv) chirag@debian10:~/ML/Finaments$
Эти 1-от print(count)
, а затем он печатает только содержимое первого <p>
тега от print(points.text)
.
Я только начал использовать beautifulsoup, пожалуйста, помогите мне.
Комментарии:
1. Можете ли вы дать этот URL-адрес, чтобы он был лучше для понимания, а также, если вы просматриваете
p
теги, чтобы вы моглиprint(point.text)
, и это покажет, я думаю!2. Вот ссылка — screener.in/company/sbin
Ответ №1:
Вы можете перейти по прямому URL-адресу, содержащему эту информацию. Однако вам нужно будет передать туда правильные файлы cookie и токены csrf:
import requests
from bs4 import BeautifulSoup
url = 'https://www.screener.in/wiki/company/3188/commentary/'
headers= {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
'referer': 'https://www.screener.in/company/SBIN/consolidated/',
'x-csrftoken': 'E8zDjm7CtmSqCM2B9rTYPXTcPMJ22w2oynWzWzT4bCgAIaKkt4DmrirBSEPdCP0W',
'cookie': '_gcl_au=1.1.69436223.1621345270; _ga=GA1.2.2056656539.1621345271; _gid=GA1.2.1452432592.1621345271; csrftoken=E8zDjm7CtmSqCM2B9rTYPXTcPMJ22w2oynWzWzT4bCgAIaKkt4DmrirBSEPdCP0W; sessionid=mrdcmrlqpe72dqjrqgtrb2m2v375sjv0; _gat_UA-2456523-7=1'}
response = requests.post(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
count = 0
for point in soup.find_all('p'):
count = count 1
print(count)
print(soup.text)
Выход:
19
Ratios (Q3FY21)
Capital Adequacy Ratio - 14.50%
Net Interest Margin - 3.34%
Gross NPA - 4.77%
Net NPA - 1.23%
CASA Ratio - 45.15%#
Branch Network
Presently, the bank operates a network of 22,330 branches and ~58,000 ATMs across India. It also operates ~71,000 business correspondent outlets across India.#
Market Share
The bank has a market share of 22.84% in deposits and 19.69% share in advances in India. It has a strong customer base of ~45 crore customers.#
Loan Book
Retail loans account for 39% of the loan book, followed by corporate (37%), SME (14%) and Agriculture (10%).#
Retail Book - Home loans account for 68% of the retail book, followed by xpress credit (22%), auto loans (9%), personal gold loans (2%) and others (9%).#
Exposure
The bank has a well-diversified loan book exposed to various sectors. Top sectors include home loans (23%), infrastructure (15%), services (12%) and agriculture (10%).
~75% of the corporate advances are rated A and better ratings from rating agencies. 38% of the corporate book accounts for PSUs amp; Govt. departments.#
Segmental NPAs
Presently, the total NPAs of the bank stands at 1,17,244 crores. agriculture segment accounts for the major ratio of NPAs i.e. 13.71% of all loans are NPA. Corporate segment accounts for 59,400 crores worth of NPAs i.e. 51% of total NPAs of the bank.#
International Business
The bank has a global footprint with a network of 233 branches/offices in 32 countries.# It has presence in USA, Canada, Brazil, Russia, Germany, France, Turkey, Australia, Bangladesh, Nepal, Sri Lanka and other countries.#
Presently, Overseas business accounts for 3% of total deposits# and 13% of total advances.#
Government Business
SBI has always been the banker of choice to the government of India and is the market leader in government business. It had turnover of ~52,50,000 lakh crores and commissions of ~3,700 crores from government business in FY20.#
Financial Inclusion Business
The bank has ~71,000 BC outlets which has primary focus on financial inclusion customers.# The bank accounts for 40% of all PMJDY accounts i.e. more than 12 crore accounts.# Presently, the deposits from PMJDY accounts are ~42,500 crores i.e. 1.2% of total deposits of the bank.
Digital Metrics
Increasing digitization resulted in ~40% of asset accounts and ~60% of liability customers added via digital channels in FY21.# 67% of all transactions were initiated through digital channels in 2020 which is up from 58% in the previous year.#
Subsidiaries Operations
The bank owns various subsidiaries which are engaged in related business activities :-
1. SBI Capital Markets Ltd (100% stake) - SBICAP is a leading investment banker, offering investment banking and corporate advisory services to clients across three product categories i.e. project advisory and structured finance, equity capital markets and debt capital markets.
This company further has wholly owned subsidiaries in related businesses viz. SBICAP Securities, SBICAP Trustee Co., SBICAP Ventures amp; others.#
2. SBI DHFI Ltd (72% stake) - It is a primary dealer and supports the book building process and provide depth and liquidity to secondary markets in G-Sec. It also deals in money market instruments, non G-Sec debt instruments, amongst others.#
3. SBI Cards and Payment Services Ltd (69% stake) - It is a non-banking financial company that offers extensive credit card portfolio to individual cardholders and corporate clients. It has diversified customer acquisition network that enables to engage prospective customers across multiple channels.#
The IPO of SBI Cards was launched in March 2020 wherein the company sold ~13 crore equity shares for a consideration of ₹10,350 crores.#
4. SBI Life Insurance Co. Ltd (57.6% stake) - It is one of the leading life insurance company in India which offers a wide range of individual and group insurance solutions that meet various life stage needs of customers.#
5. SBI Funds Management Pvt Ltd (63% stake) - It is a JV between SBI and AMUNDI (France). It is an asset management company with the fastest CAGR of 33% as against industrial average of 14% in the last 3 years.#
6. SBI General Insurance Company Ltd (70% stake) - It is a general insurance company which focuses on profitable growth in banc-assurance channel along with other distribution channels and line of businesses. It is first non-life insurance company in India to cross 6,000 crores in a decade of operations.#
Amalgamation of Associate Banks
In March 2017, the bank acquired its 5 associate state banks and Bharatiya Mahila Bank by allotting ~13.5 crore equity shares of SBI.#
Комментарии:
1. Спасибо @chitown88, но почему вы использовали post-запрос? Я подумал, что нам нужно использовать метод GET, а затем использовать
response.text
его для получения html-документа, вот к чему я перешелBeautifulSoup()
. Кроме того, где я могу найти эти файлы cookie и токены csrf?2. Вы обнаружите это в инструментах разработки (Ctrl-Sft-i) при загрузке страницы (возможно, вам потребуется обновить страницу) и просмотре сделанных запросов. Он сообщает вам метод и параметры запроса, заголовки запросов и все, что там есть.