I am currently working on a scraper project which is much important to ensure EVERY request got properly handled, i.e., either to log an error or to save a successful result. I've already implemented the basic spider, and I can now process 99% of the requests successfully, but I could get errors like captcha, 50x, 30x, or even no enough fields in the result(then I'll try another website to find the missing fields).
At first, I thought it's more "logical" to raise exceptions in the parsing callback and process them all in errback, this could make the code more readable. But I tried only to find out errback can only trap errors in the downloader module, such as non-200 response statuses. If I raise a self-implemented ParseError in the callback, the spider just raises it and stops.
Even if I'll have to process the parsing request directly in the callback, I don't know how to retry the request immediately in the callback in a clean fashion. u know, I may have to include a different proxy to send another request, or modify some request header.
I admit I'm relatively new to scrapy but I've tried back and forth for days and still cannot get this to working… I've checked every single question on SO and no one matches, thanks in advance for the help.
UPDATE: I realize this could be a very complex question so I try to illustrate the scenario in the following pseudo code, hope this helps:
from scraper.myexceptions import *
def parseRound1(self, response):
.... some parsing routines ...
if something wrong happened:
# this causes the spider raises a SpiderException and stops
raise CaptchaError
...
if no enough fields scraped:
raise ParseError(task, "no enough fields")
else:
return items
def parseRound2(self, response):
...some other parsing routines...
def errHandler(self, failure):
# how to trap all the exceptions?
r = failure.trap()
# cannot trap ParseError here
if r == CaptchaError:
# how to enqueue the original request here?
retry
elif r == ParseError:
if raised from parseRound1:
new request for Round2
else:
some other retry mechanism
elif r == HTTPError:
ignore or retry
EDIT 16 nov 2012: Scrapy >=0.16 uses a different method to attach methods to signals, extra example added
The most simple solution would be to write an extension in which you capture failures, using Scrapy signals. For example; the following extension will catch all errors and print a traceback.
You could do anything with the Failure - like save to your database, or send an email - which itself is an instance of twisted.python.failure.Failure.
For Scrapy versions till 0.16:
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
class FailLogger(object):
def __init__(self):
"""
Attach appropriate handlers to the signals
"""
dispatcher.connect(self.spider_error, signal=signals.spider_error)
def spider_error(self, failure, response, spider):
print "Error on {0}, traceback: {1}".format(response.url, failure.getTraceback())
For Scrapy versions from 0.16 and up:
from scrapy import signals
class FailLogger(object):
@classmethod
def from_crawler(cls, crawler):
ext = cls()
crawler.signals.connect(ext.spider_error, signal=signals.spider_error)
return ext
def spider_error(self, failure, response, spider):
print "Error on {0}, traceback: {1}".format(response.url, failure.getTraceback())
You would enable the extension in the settings, with something like:
EXTENSIONS = {
'spiders.extensions.faillog.FailLogger': 599,
}
At first, I thought it's more "logical" to raise exceptions in the parsing callback and process them all in errback, this could make the code more readable. But I tried only to find out errback can only trap errors in the downloader module, such as non-200 response statuses. If I raise a self-implemented ParseError in the callback, the spider just raises it and stops.
Yes, you are right - callback
and errback
are meant to be used only with downloader, as twisted
is used for downloading a resource, and twisted uses deffereds - that's why callbacks are needed.
The only async part in scrapy usually is downloader, all the other parts working synchronously.
So, if you want to catch all non-downloader errors - do it yourself:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With