#web-scraping #scrapy #scrapy-shell
Вопрос:
Я пытаюсь очистить страницу продукта Amazon, но scrapy дает мне противоречивые результаты (иногда он возвращает то, что я хочу, а иногда ничего не возвращает). Я понятия не имею, почему один и тот же код дает разные результаты. Я создал цикл, который выдает один и тот же запрос 10 раз, и он давал мне разные результаты. Кто-нибудь может мне помочь?
import scrapy from scrapy import Request class AmzsingleSpider(scrapy.Spider): name = 'amzsingle' def start_requests(self): for i in range(10): yield Request(url="https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929", callback=self.parse, dont_filter=True) def parse(self, response): yield { 'title': response.xpath('//span[@id="productTitle"]/text()').get() }
и это журнал, который я получаю в терминале. Эта попытка не дала 9 результатов и 1 не найдена (в другой раз она возвращала 7 результатов и 3 не найдено).:
2021-11-27 22:08:26 [scrapy.core.engine] DEBUG: Crawled (200) lt;GET https://www.amazon.com/robots.txtgt; (referer: None) 2021-11-27 22:08:30 [scrapy.core.engine] DEBUG: Crawled (200) lt;GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; (referer: None) 2021-11-27 22:08:30 [scrapy.core.scraper] DEBUG: Scraped from lt;200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; {'title': None} 2021-11-27 22:08:32 [scrapy.core.engine] DEBUG: Crawled (200) lt;GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; (referer: None) 2021-11-27 22:08:33 [scrapy.core.scraper] DEBUG: Scraped from lt;200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; {'title': None} 2021-11-27 22:08:35 [scrapy.core.engine] DEBUG: Crawled (200) lt;GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; (referer: None) 2021-11-27 22:08:35 [scrapy.core.scraper] DEBUG: Scraped from lt;200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; {'title': None} 2021-11-27 22:08:36 [scrapy.core.engine] DEBUG: Crawled (200) lt;GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; (referer: None) 2021-11-27 22:08:36 [scrapy.core.scraper] DEBUG: Scraped from lt;200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; {'title': None} 2021-11-27 22:08:38 [scrapy.core.engine] DEBUG: Crawled (200) lt;GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; (referer: None) 2021-11-27 22:08:38 [scrapy.core.scraper] DEBUG: Scraped from lt;200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; {'title': None} 2021-11-27 22:08:39 [scrapy.core.engine] DEBUG: Crawled (200) lt;GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; (referer: None) 2021-11-27 22:08:39 [scrapy.core.scraper] DEBUG: Scraped from lt;200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; {'title': None} 2021-11-27 22:08:40 [scrapy.core.engine] DEBUG: Crawled (200) lt;GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; (referer: None) 2021-11-27 22:08:40 [scrapy.core.scraper] DEBUG: Scraped from lt;200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; {'title': None} 2021-11-27 22:08:41 [scrapy.core.engine] DEBUG: Crawled (200) lt;GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; (referer: None) 2021-11-27 22:08:41 [scrapy.core.scraper] DEBUG: Scraped from lt;200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; {'title': None} 2021-11-27 22:08:43 [scrapy.core.engine] DEBUG: Crawled (200) lt;GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; (referer: None) 2021-11-27 22:08:43 [scrapy.core.scraper] DEBUG: Scraped from lt;200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; {'title': 'n¡Avancemos!: Student Edition Level 3 2013 (Spanish Edition)n'} 2021-11-27 22:08:45 [scrapy.core.engine] DEBUG: Crawled (200) lt;GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; (referer: None) 2021-11-27 22:08:45 [scrapy.core.scraper] DEBUG: Scraped from lt;200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; {'title': None} 2021-11-27 22:08:45 [scrapy.core.engine] INFO: Closing spider (finished) 2021-11-27 22:08:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 4664, 'downloader/request_count': 11, 'downloader/request_method_count/GET': 11, 'downloader/response_bytes': 1508328, 'downloader/response_count': 11, 'downloader/response_status_count/200': 11, 'elapsed_time_seconds': 20.82323, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2021, 11, 27, 15, 8, 45, 324091), 'httpcompression/response_bytes': 7323320, 'httpcompression/response_count': 11, 'item_scraped_count': 10, 'log_count/DEBUG': 22, 'log_count/INFO': 11, 'memusage/max': 53161984, 'memusage/startup': 53161984, 'proxies/good': 1, 'proxies/mean_backoff': 0.0, 'proxies/reanimated': 0, 'proxies/unchecked': 0, 'response_received_count': 11, 'robotstxt/request_count': 1, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/200': 1, 'scheduler/dequeued': 10, 'scheduler/dequeued/memory': 10, 'scheduler/enqueued': 10, 'scheduler/enqueued/memory': 10, 'start_time': datetime.datetime(2021, 11, 27, 15, 8, 24, 500861)} 2021-11-27 22:08:45 [scrapy.core.engine] INFO: Spider closed (finished)
Комментарии:
1. Почему вы используете диапазон? почему бы не ввести цикл i в URL-адрес? если это так, URL-адрес будет недействительным, поскольку URL-адрес содержит только один заголовок, и в соответствии с вашим выбором вывод будет правильным. URL — адрес не содержит следующих страниц.
2. Использование диапазона было просто для демонстрации того, что один и тот же код возвращал разные результаты
Ответ №1:
Вы можете использовать селектор CSS.
import scrapy from scrapy import Request class AmzsingleSpider(scrapy.Spider): name = 'amzsingle-parse' def start_requests(self): for i in range(10): yield Request(url="https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929", callback=self.parse, dont_filter=True) def parse(self, response): yield { 'title': response.css('#productTitle ::text').get() }
Выход
{"title": "nu00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)n"} 2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) lt;GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; (referer: None) 2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) lt;GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; (referer: None) 2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) lt;GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; (referer: None) 2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) lt;GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; (referer: None) 2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) lt;GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; (referer: None) 2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from lt;200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; {"title": "nu00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)n"} 2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from lt;200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; {"title": "nu00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)n"} 2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from lt;200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; {"title": "nu00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)n"} 2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from lt;200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; {"title": "nu00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)n"} 2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from lt;200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; {"title": "nu00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)n"} 2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) lt;GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; (referer: None) 2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from lt;200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; {"title": "nu00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)n"} 2021-11-27 15:56:42 [scrapy.core.engine] DEBUG: Crawled (200) lt;GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; (referer: None) 2021-11-27 15:56:42 [scrapy.core.scraper] DEBUG: Scraped from lt;200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929gt; {"title": "nu00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)n"}