Ошибка веб-страница очистки не подключается повторно, но может быть запущена снова

#python #web-scraping #proxy #scrapy #http-proxy

#python #веб-очистка #прокси #scrapy #http-прокси

Вопрос:

Я очищаю веб-сайт, и иногда он отправляет мне это сообщение и не подключается повторно к целевой веб-странице

 2020-08-18 22:37:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:38:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 35 pages/min), scraped 116421 items (at 35 items/min)
2020-08-18 22:38:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:38:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:39:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min), scraped 116421 items (at 0 items/min)
2020-08-18 22:39:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:39:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:40:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min), scraped 116421 items (at 0 items/min)
2020-08-18 22:40:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:40:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:41:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min), scraped 116421 items (at 0 items/min)
2020-08-18 22:41:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:41:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:42:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min), scraped 116421 items (at 0 items/min)
2020-08-18 22:42:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:42:30 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
2020-08-18 22:43:00 [scrapy.extensions.logstats] INFO: Crawled 116421 pages (at 0 pages/min), scraped 116421 items (at 0 items/min)
2020-08-18 22:43:00 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 0, unchecked: 0, reanimated: 0, mean backoff time: 0s)
  

Я использую сменный прокси, который обновляется каждый час. Попробуйте прокси с другим spider, и он отлично работает на той же странице.
В чем может быть проблема?, Как я могу спасти данные, которые уже были удалены?

Код:

 import scrapy

class Pool(scrapy.Spider):
    name = 'pool'
    start_urls = [l.strip() for l in open("D:links.txt").readlines()]

    def parse(self,response):
        pool1 = response.xpath("/html/[6]").get('').strip()
        url = response.url
        yield {
            'Pool1': pool1,
            'Url ': url ,
        }
  

Настройки:

 BOT_NAME = 'Pool'

SPIDER_MODULES = ['Pool.spiders']
NEWSPIDER_MODULE = 'Pool.spiders'

ROBOTSTXT_OBEY = False
FEED_EXPORTERS = {
    'xlsx': 'scrapy_xlsx.XlsxItemExporter',
}
DOWNLOAD_TIMEOUT = 3600
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}
COOKIES_ENABLED = False
ROTATING_PROXY_LIST = [
    'IPproxyhttp',
]
  

Ответ №1:

Я думаю, может быть, что страница или все прокси отключены одновременно и ожидают DOWNLOAD_TIMEOUT