Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to process all kinds of exception in a scrapy project, in errback and callback?

Tags:

python

scrapy

I am currently working on a scraper project which is much important to ensure EVERY request got properly handled, i.e., either to log an error or to save a successful result. I've already implemented the basic spider, and I can now process 99% of the requests successfully, but I could get errors like captcha, 50x, 30x, or even no enough fields in the result(then I'll try another website to find the missing fields).

At first, I thought it's more "logical" to raise exceptions in the parsing callback and process them all in errback, this could make the code more readable. But I tried only to find out errback can only trap errors in the downloader module, such as non-200 response statuses. If I raise a self-implemented ParseError in the callback, the spider just raises it and stops.

Even if I'll have to process the parsing request directly in the callback, I don't know how to retry the request immediately in the callback in a clean fashion. u know, I may have to include a different proxy to send another request, or modify some request header.

I admit I'm relatively new to scrapy but I've tried back and forth for days and still cannot get this to working… I've checked every single question on SO and no one matches, thanks in advance for the help.

UPDATE: I realize this could be a very complex question so I try to illustrate the scenario in the following pseudo code, hope this helps:

from scraper.myexceptions import *

def parseRound1(self, response):

    .... some parsing routines ...
    if something wrong happened:
       # this causes the spider raises a SpiderException and stops
       raise CaptchaError
    ...

    if no enough fields scraped:
       raise ParseError(task, "no enough fields")
    else:
       return items

def parseRound2(self, response):
    ...some other parsing routines...

def errHandler(self, failure):
    # how to trap all the exceptions?
    r = failure.trap()
    # cannot trap ParseError here
    if r == CaptchaError:
       # how to enqueue the original request here?
       retry
    elif r == ParseError:
        if raised from parseRound1:
            new request for Round2
        else:
            some other retry mechanism
    elif r == HTTPError:
       ignore or retry
like image 389
Shadow Lau Avatar asked Jun 17 '12 05:06

Shadow Lau


2 Answers

EDIT 16 nov 2012: Scrapy >=0.16 uses a different method to attach methods to signals, extra example added

The most simple solution would be to write an extension in which you capture failures, using Scrapy signals. For example; the following extension will catch all errors and print a traceback.

You could do anything with the Failure - like save to your database, or send an email - which itself is an instance of twisted.python.failure.Failure.

For Scrapy versions till 0.16:

from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

class FailLogger(object):
  def __init__(self):
    """ 
    Attach appropriate handlers to the signals
    """
    dispatcher.connect(self.spider_error, signal=signals.spider_error)

  def spider_error(self, failure, response, spider):
    print "Error on {0}, traceback: {1}".format(response.url, failure.getTraceback())

For Scrapy versions from 0.16 and up:

from scrapy import signals

class FailLogger(object):

  @classmethod
  def from_crawler(cls, crawler):
    ext = cls()

    crawler.signals.connect(ext.spider_error, signal=signals.spider_error)

    return ext

  def spider_error(self, failure, response, spider):
    print "Error on {0}, traceback: {1}".format(response.url, failure.getTraceback())  

You would enable the extension in the settings, with something like:

EXTENSIONS = {
'spiders.extensions.faillog.FailLogger': 599,
}
like image 65
Sjaak Trekhaak Avatar answered Oct 06 '22 00:10

Sjaak Trekhaak


At first, I thought it's more "logical" to raise exceptions in the parsing callback and process them all in errback, this could make the code more readable. But I tried only to find out errback can only trap errors in the downloader module, such as non-200 response statuses. If I raise a self-implemented ParseError in the callback, the spider just raises it and stops.

Yes, you are right - callback and errback are meant to be used only with downloader, as twisted is used for downloading a resource, and twisted uses deffereds - that's why callbacks are needed.

The only async part in scrapy usually is downloader, all the other parts working synchronously.

So, if you want to catch all non-downloader errors - do it yourself:

  • make a big try/except in the callback
  • or make a decorator for your callbacks which will do this (i like this idea more)
like image 41
warvariuc Avatar answered Oct 06 '22 00:10

warvariuc