2020-10-16 21:53:19 [scrapy.core.scraper] ОШИБКА: паук должен вернуть запрос, элемент или нет, получил ‘str’ в

#python #python-3.x #scrapy #screen-scraping

#python #python-3.x #scrapy #очистка экрана

Вопрос:

Я пытаюсь извлечь некоторые цитаты отсюда, используя Scrapy, но я столкнулся с какой-то проблемой. вот мой код.

 import scrapy
start_urls=['https://www.goodreads.com/quotes']
for number in range(1,11):
  start_urls.append('https://www.goodreads.com/{}'.format(str(number)))

class quotes(scrapy.Spider):
name='goodreads_quotes'
def start_requests(self):
    urls=start_urls
    for url in urls:
        yield scrapy.Request(url=url,callback=self.parse)
        
def parse(self,response):
    quotes=response.css('div .quoteText::text').extract()
    for quote in quotes:
        if len(quote)>10:
           yield quote
  

Каждый раз, когда я пытаюсь запустить его в оболочке scrapy, я получаю следующую ошибку

 2020-10-16 21:53:16 [scrapy.core.engine] INFO: Spider opened
2020-10-16 21:53:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 
items (at 0 items/min)
2020-10-16 21:53:16 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-16 21:53:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET 
https://www.goodreads.com/robots.txt> (referer: None)
2020-10-16 21:53:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.goodreads.com/quotes> 
(referer: None)
2020-10-16 21:53:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.goodreads.com/7> 
 (referer: None)
2020-10-16 21:53:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.goodreads.com/2> 
(referer: None)
2020-10-16 21:53:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.goodreads.com/5> 
(referer: None)
2020-10-16 21:53:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.goodreads.com/3> 
(referer: None)
2020-10-16 21:53:19 [scrapy.core.scraper] ERROR: Spider must return request, item, or None, got 'str' 
in <GET https://www.goodreads.com/quotes>
2020-10-16 21:53:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.goodreads.com/6> 
(referer: None)
2020-10-16 21:53:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.goodreads.com/4> 
(referer: None)
2020-10-16 21:53:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.goodreads.com/1> 
(referer: None)
2020-10-16 21:53:19 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 
https://www.goodreads.com/7>: HTTP status code is not handled or not allowed
2020-10-16 21:53:19 [scrapy.core.scraper] ERROR: Spider must return request, item, or None, got 'str' 
in <GET https://www.goodreads.com/quotes>
2020-10-16 21:53:19 [scrapy.core.scraper] ERROR: Spider must return request, item, or None, got 'str' 
in <GET https://www.goodreads.com/quotes>
2020-10-16 21:53:19 [scrapy.core.scraper] ERROR: Spider must return request, item, or None, got 'str' 
in <GET https://www.goodreads.com/quotes>
2020-10-16 21:53:19 [scrapy.core.scraper] ERROR: Spider must return request, item, or None, got 'str' 
in <GET https://www.goodreads.com/quotes>
2020-10-16 21:53:19 [scrapy.core.scraper] ERROR: Spider must return request, item, or None, got 'str' 
in <GET https://www.goodreads.com/quotes>
2020-10-16 21:53:19 [scrapy.core.scraper] ERROR: Spider must return request, item, or None, got 'str' 
in <GET https://www.goodreads.com/quotes>

2020-10-16 21:53:19 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 
https://www.goodreads.com/9>: HTTP status code is not handled or not allowed
2020-10-16 21:53:19 [scrapy.core.engine] INFO: Closing spider (finished)
  

Есть ли у кого-нибудь какие-либо предложения, которые могут помочь мне успешно очистить сайт?

Ответ №1:

Как указывает ошибка, parse функция должна возвращать request , item , или None . Это ошибка, потому что вы пытаетесь вернуть a str . Вместо возврата a str вы можете решить эту проблему, создав класс, который наследует scrapy.Item и хранит нужные вам данные:

 # Create a scrapy.Item class which will hold all the scraped data
class Quote(scrapy.Item):
    text = scrapy.Field()
    # any additional info you want to put in a quote...

class QuoteSpider(scrapy.Spider):
    ...

    def parse(self, response):
        quotes = response.css('div .quoteText::text').extract()
        for quote in quotes:
            if len(quote) > 10:
               # We return a Quote scrapy.Item instead of a string!
               yield Quote(text=quote)
  

Ответ №2:

Похоже, вы забыли определить поля. Перейдите к items.py файл и вставьте код, написанный ниже, внутрь класса:

  quotes = scrapy.Field()