Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I catch errors with scrapy so I can do something when I get User Timeout error?

ERROR: Error downloading <GET URL_HERE>: User timeout caused connection failure.

I get this issue every now and then when using my scraper. Is there a way I can catch this issue and run a function when it happens? I can't find out how to do it online anywhere.

like image 403
Ryan Weinstein Avatar asked Jun 30 '15 18:06

Ryan Weinstein


People also ask

How do you handle a Scrapy error?

What you can do is define an errback in your Request instances: errback (callable) – a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP errors and such. It receives a Twisted Failure instance as first parameter.

How do you get a Scrapy response?

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.

How do you stop a spider from being Scrapy?

In the latest version of Scrapy, available on GitHub, you can raise a CloseSpider exception to manually close a spider. It succeeds to force stop, but not fast enough. It still lets some Request running.


2 Answers

What you can do is define an errback in your Request instances:

errback (callable) – a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP errors and such. It receives a Twisted Failure instance as first parameter.

Here's some sample code (for scrapy 1.0) that you can use:

# -*- coding: utf-8 -*-
# errbacks.py
import scrapy

# from scrapy.contrib.spidermiddleware.httperror import HttpError
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError


class ErrbackSpider(scrapy.Spider):
    name = "errbacks"
    start_urls = [
        "http://www.httpbin.org/",              # HTTP 200 expected
        "http://www.httpbin.org/status/404",    # Not found error
        "http://www.httpbin.org/status/500",    # server issue
        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
        "http://www.httphttpbinbin.org/",       # DNS error expected
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.error('Got successful response from {}'.format(response.url))
        # do something useful now

    def errback_httpbin(self, failure):
        # log all errback failures,
        # in case you want to do something special for some errors,
        # you may need the failure's type
        self.logger.error(repr(failure))

        #if isinstance(failure.value, HttpError):
        if failure.check(HttpError):
            # you can get the response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        #elif isinstance(failure.value, DNSLookupError):
        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        #elif isinstance(failure.value, TimeoutError):
        elif failure.check(TimeoutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

And the output in scrapy shell (only 1 retry and 5s download timeout):

$ scrapy runspider errbacks.py --set DOWNLOAD_TIMEOUT=5 --set RETRY_TIMES=1
2015-06-30 23:45:55 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-30 23:45:55 [scrapy] INFO: Optional features available: ssl, http11
2015-06-30 23:45:55 [scrapy] INFO: Overridden settings: {'DOWNLOAD_TIMEOUT': '5', 'RETRY_TIMES': '1'}
2015-06-30 23:45:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-06-30 23:45:56 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-30 23:45:56 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-30 23:45:56 [scrapy] INFO: Enabled item pipelines: 
2015-06-30 23:45:56 [scrapy] INFO: Spider opened
2015-06-30 23:45:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-30 23:45:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httphttpbinbin.org/> (failed 1 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname.
2015-06-30 23:45:56 [scrapy] DEBUG: Gave up retrying <GET http://www.httphttpbinbin.org/> (failed 2 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname.
2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.DNSLookupError'>>
2015-06-30 23:45:56 [errbacks] ERROR: DNSLookupError on http://www.httphttpbinbin.org/
2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (200) <GET http://www.httpbin.org/> (referer: None)
2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (404) <GET http://www.httpbin.org/status/404> (referer: None)
2015-06-30 23:45:56 [errbacks] ERROR: Got successful response from http://www.httpbin.org/
2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>>
2015-06-30 23:45:56 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/404
2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 1 times): 500 Internal Server Error
2015-06-30 23:45:57 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org/status/500> (failed 2 times): 500 Internal Server Error
2015-06-30 23:45:57 [scrapy] DEBUG: Crawled (500) <GET http://www.httpbin.org/status/500> (referer: None)
2015-06-30 23:45:57 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>>
2015-06-30 23:45:57 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/500
2015-06-30 23:46:01 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org:12345/> (failed 1 times): User timeout caused connection failure.
2015-06-30 23:46:06 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org:12345/> (failed 2 times): User timeout caused connection failure.
2015-06-30 23:46:06 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.TimeoutError'>>
2015-06-30 23:46:06 [errbacks] ERROR: TimeoutError on http://www.httpbin.org:12345/
2015-06-30 23:46:06 [scrapy] INFO: Closing spider (finished)
2015-06-30 23:46:06 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 4,
 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,
 'downloader/request_bytes': 1748,
 'downloader/request_count': 8,
 'downloader/request_method_count/GET': 8,
 'downloader/response_bytes': 12506,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'downloader/response_status_count/500': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 30, 21, 46, 6, 537191),
 'log_count/DEBUG': 10,
 'log_count/ERROR': 9,
 'log_count/INFO': 7,
 'response_received_count': 3,
 'scheduler/dequeued': 8,
 'scheduler/dequeued/memory': 8,
 'scheduler/enqueued': 8,
 'scheduler/enqueued/memory': 8,
 'start_time': datetime.datetime(2015, 6, 30, 21, 45, 56, 322235)}
2015-06-30 23:46:06 [scrapy] INFO: Spider closed (finished)

Notice how scrapy logs the exceptions in its stats:

'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,
like image 154
paul trmbrth Avatar answered Oct 13 '22 04:10

paul trmbrth


I prefer to have a custom Retry Middleware like this:

from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware

from fake_useragent import FakeUserAgentError

class FakeUserAgentErrorRetryMiddleware(RetryMiddleware):

    def process_exception(self, request, exception, spider):
        if type(exception) == FakeUserAgentError: return self._retry(request, exception, spider)
like image 43
Aminah Nuraini Avatar answered Oct 13 '22 03:10

Aminah Nuraini