<pre class="prettyprint"><code>ERROR: Error downloading <GET URL_HERE>: User timeout caused connection failure. </code></pre> I get this issue every now and then when using my scraper. Is there a way I can catch this issue and run a function when it happens? I can't find out how to do it online anywhere.

What you can do is define an <code>errback</code> in your <code>Request</code> instances: <blockquote> errback (callable) – a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP errors and such. It receives a Twisted Failure instance as first parameter. </blockquote> Here's some sample code (for scrapy 1.0) that you can use: <pre class="prettyprint"><code># -*- coding: utf-8 -*- # errbacks.py import scrapy # from scrapy.contrib.spidermiddleware.httperror import HttpError from scrapy.spidermiddlewares.httperror import HttpError from twisted.internet.error import DNSLookupError from twisted.internet.error import TimeoutError class ErrbackSpider(scrapy.Spider): name = "errbacks" start_urls = [ "http://www.httpbin.org/", # HTTP 200 expected "http://www.httpbin.org/status/404", # Not found error "http://www.httpbin.org/status/500", # server issue "http://www.httpbin.org:12345/", # non-responding host, timeout expected "http://www.httphttpbinbin.org/", # DNS error expected ] def start_requests(self): for u in self.start_urls: yield scrapy.Request(u, callback=self.parse_httpbin, errback=self.errback_httpbin, dont_filter=True) def parse_httpbin(self, response): self.logger.error('Got successful response from {}'.format(response.url)) # do something useful now def errback_httpbin(self, failure): # log all errback failures, # in case you want to do something special for some errors, # you may need the failure's type self.logger.error(repr(failure)) #if isinstance(failure.value, HttpError): if failure.check(HttpError): # you can get the response response = failure.value.response self.logger.error('HttpError on %s', response.url) #elif isinstance(failure.value, DNSLookupError): elif failure.check(DNSLookupError): # this is the original request request = failure.request self.logger.error('DNSLookupError on %s', request.url) #elif isinstance(failure.value, TimeoutError): elif failure.check(TimeoutError): request = failure.request self.logger.error('TimeoutError on %s', request.url) </code></pre> And the output in scrapy shell (only 1 retry and 5s download timeout): <pre class="prettyprint"><code>$ scrapy runspider errbacks.py --set DOWNLOAD_TIMEOUT=5 --set RETRY_TIMES=1 2015-06-30 23:45:55 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot) 2015-06-30 23:45:55 [scrapy] INFO: Optional features available: ssl, http11 2015-06-30 23:45:55 [scrapy] INFO: Overridden settings: {'DOWNLOAD_TIMEOUT': '5', 'RETRY_TIMES': '1'} 2015-06-30 23:45:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 2015-06-30 23:45:56 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2015-06-30 23:45:56 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2015-06-30 23:45:56 [scrapy] INFO: Enabled item pipelines: 2015-06-30 23:45:56 [scrapy] INFO: Spider opened 2015-06-30 23:45:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2015-06-30 23:45:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httphttpbinbin.org/> (failed 1 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname. 2015-06-30 23:45:56 [scrapy] DEBUG: Gave up retrying <GET http://www.httphttpbinbin.org/> (failed 2 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname. 2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.DNSLookupError'>> 2015-06-30 23:45:56 [errbacks] ERROR: DNSLookupError on http://www.httphttpbinbin.org/ 2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (200) <GET http://www.httpbin.org/> (referer: None) 2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (404) <GET http://www.httpbin.org/status/404> (referer: None) 2015-06-30 23:45:56 [errbacks] ERROR: Got successful response from http://www.httpbin.org/ 2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>> 2015-06-30 23:45:56 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/404 2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 1 times): 500 Internal Server Error 2015-06-30 23:45:57 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org/status/500> (failed 2 times): 500 Internal Server Error 2015-06-30 23:45:57 [scrapy] DEBUG: Crawled (500) <GET http://www.httpbin.org/status/500> (referer: None) 2015-06-30 23:45:57 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>> 2015-06-30 23:45:57 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/500 2015-06-30 23:46:01 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org:12345/> (failed 1 times): User timeout caused connection failure. 2015-06-30 23:46:06 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org:12345/> (failed 2 times): User timeout caused connection failure. 2015-06-30 23:46:06 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.TimeoutError'>> 2015-06-30 23:46:06 [errbacks] ERROR: TimeoutError on http://www.httpbin.org:12345/ 2015-06-30 23:46:06 [scrapy] INFO: Closing spider (finished) 2015-06-30 23:46:06 [scrapy] INFO: Dumping Scrapy stats: {'downloader/exception_count': 4, 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2, 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2, 'downloader/request_bytes': 1748, 'downloader/request_count': 8, 'downloader/request_method_count/GET': 8, 'downloader/response_bytes': 12506, 'downloader/response_count': 4, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/404': 1, 'downloader/response_status_count/500': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 6, 30, 21, 46, 6, 537191), 'log_count/DEBUG': 10, 'log_count/ERROR': 9, 'log_count/INFO': 7, 'response_received_count': 3, 'scheduler/dequeued': 8, 'scheduler/dequeued/memory': 8, 'scheduler/enqueued': 8, 'scheduler/enqueued/memory': 8, 'start_time': datetime.datetime(2015, 6, 30, 21, 45, 56, 322235)} 2015-06-30 23:46:06 [scrapy] INFO: Spider closed (finished) </code></pre> Notice how scrapy logs the exceptions in its stats: <pre class="prettyprint"><code>'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2, 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2, </code></pre>

How do I catch errors with scrapy so I can do something when I get User Timeout error?

Tags:

python

scrapy

twisted

ERROR: Error downloading <GET URL_HERE>: User timeout caused connection failure.

I get this issue every now and then when using my scraper. Is there a way I can catch this issue and run a function when it happens? I can't find out how to do it online anywhere.

403

asked Jun 30 '15 18:06

Ryan Weinstein

2 Answers

What you can do is define an errback in your Request instances:

errback (callable) – a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP errors and such. It receives a Twisted Failure instance as first parameter.

Here's some sample code (for scrapy 1.0) that you can use:

# -*- coding: utf-8 -*-
# errbacks.py
import scrapy

# from scrapy.contrib.spidermiddleware.httperror import HttpError
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError


class ErrbackSpider(scrapy.Spider):
    name = "errbacks"
    start_urls = [
        "http://www.httpbin.org/",              # HTTP 200 expected
        "http://www.httpbin.org/status/404",    # Not found error
        "http://www.httpbin.org/status/500",    # server issue
        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
        "http://www.httphttpbinbin.org/",       # DNS error expected
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.error('Got successful response from {}'.format(response.url))
        # do something useful now

    def errback_httpbin(self, failure):
        # log all errback failures,
        # in case you want to do something special for some errors,
        # you may need the failure's type
        self.logger.error(repr(failure))

        #if isinstance(failure.value, HttpError):
        if failure.check(HttpError):
            # you can get the response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        #elif isinstance(failure.value, DNSLookupError):
        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        #elif isinstance(failure.value, TimeoutError):
        elif failure.check(TimeoutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

And the output in scrapy shell (only 1 retry and 5s download timeout):

$ scrapy runspider errbacks.py --set DOWNLOAD_TIMEOUT=5 --set RETRY_TIMES=1
2015-06-30 23:45:55 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-30 23:45:55 [scrapy] INFO: Optional features available: ssl, http11
2015-06-30 23:45:55 [scrapy] INFO: Overridden settings: {'DOWNLOAD_TIMEOUT': '5', 'RETRY_TIMES': '1'}
2015-06-30 23:45:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-06-30 23:45:56 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-30 23:45:56 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-30 23:45:56 [scrapy] INFO: Enabled item pipelines: 
2015-06-30 23:45:56 [scrapy] INFO: Spider opened
2015-06-30 23:45:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-30 23:45:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httphttpbinbin.org/> (failed 1 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname.
2015-06-30 23:45:56 [scrapy] DEBUG: Gave up retrying <GET http://www.httphttpbinbin.org/> (failed 2 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname.
2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.DNSLookupError'>>
2015-06-30 23:45:56 [errbacks] ERROR: DNSLookupError on http://www.httphttpbinbin.org/
2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (200) <GET http://www.httpbin.org/> (referer: None)
2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (404) <GET http://www.httpbin.org/status/404> (referer: None)
2015-06-30 23:45:56 [errbacks] ERROR: Got successful response from http://www.httpbin.org/
2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>>
2015-06-30 23:45:56 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/404
2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 1 times): 500 Internal Server Error
2015-06-30 23:45:57 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org/status/500> (failed 2 times): 500 Internal Server Error
2015-06-30 23:45:57 [scrapy] DEBUG: Crawled (500) <GET http://www.httpbin.org/status/500> (referer: None)
2015-06-30 23:45:57 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>>
2015-06-30 23:45:57 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/500
2015-06-30 23:46:01 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org:12345/> (failed 1 times): User timeout caused connection failure.
2015-06-30 23:46:06 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org:12345/> (failed 2 times): User timeout caused connection failure.
2015-06-30 23:46:06 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.TimeoutError'>>
2015-06-30 23:46:06 [errbacks] ERROR: TimeoutError on http://www.httpbin.org:12345/
2015-06-30 23:46:06 [scrapy] INFO: Closing spider (finished)
2015-06-30 23:46:06 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 4,
 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,
 'downloader/request_bytes': 1748,
 'downloader/request_count': 8,
 'downloader/request_method_count/GET': 8,
 'downloader/response_bytes': 12506,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'downloader/response_status_count/500': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 30, 21, 46, 6, 537191),
 'log_count/DEBUG': 10,
 'log_count/ERROR': 9,
 'log_count/INFO': 7,
 'response_received_count': 3,
 'scheduler/dequeued': 8,
 'scheduler/dequeued/memory': 8,
 'scheduler/enqueued': 8,
 'scheduler/enqueued/memory': 8,
 'start_time': datetime.datetime(2015, 6, 30, 21, 45, 56, 322235)}
2015-06-30 23:46:06 [scrapy] INFO: Spider closed (finished)

Notice how scrapy logs the exceptions in its stats:

'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,

154

answered Oct 13 '22 04:10

paul trmbrth

I prefer to have a custom Retry Middleware like this:

from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware

from fake_useragent import FakeUserAgentError

class FakeUserAgentErrorRetryMiddleware(RetryMiddleware):

    def process_exception(self, request, exception, spider):
        if type(exception) == FakeUserAgentError: return self._retry(request, exception, spider)

answered Oct 13 '22 03:10

Aminah Nuraini

Related questions
                            
                                How to debug Web2py applications?
                            
                                Using the same decorator (with arguments) with functions and methods
                            
                                Python: find a list within members of another list(in order)
                            
                                Image color detection using python
                            
                                How do I install M2Crypto on Ubuntu?
                            
                                SSH Tunnel for Python MySQLdb connection
                            
                                Strange PEP8 recommendation on comparing Boolean values to True or False
                            
                                simple inter-process communication
                            
                                Run BASH built-in commands in Python?
                            
                                Check if file system is case-insensitive in Python
                            
                                Using Python's max to return two equally large values
                            
                                Python: JSON string to list of dictionaries - Getting error when iterating
                            
                                Get IP Address when testing flask application through nosetests
                            
                                How can I get Python to automatically create missing key/value pairs in a dictionary? [duplicate]
                            
                                Python write string of bytes to file
                            
                                What does "if var" mean in python?
                            
                                What is the Difference between PySphere and PyVmomi?
                            
                                Python property returning property object
                            
                                Convert date to float for linear regression on Pandas data frame
                            
                                pg_config executable not found when using pgxnclient on Windows 7 x64

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With