Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Downloader Middleware to ignore all requests to a certain URL in scrapy

I am trying to define a custom downloader middleware in Scrapy to ignore all requests to a particular URL (these requests are redirected from other URLs, so I can't filter them out when I generate the requests in the first place).

I have the following code, the idea of which is to catch this at the response processing stage (as I'm not exactly sure how requests redirecting to other requests works), check the URL, and if it matches the one I'm trying to filter out then return an IgnoreRequest exception, if not, return the response as usual so that it can continue to be processed.

from scrapy.exceptions import IgnoreRequest
from scrapy import log

class CustomDownloaderMiddleware:

    def process_response(request, response, spider):
        log.msg("In Middleware " + response.url, level=log.WARNING)
        if response.url == "http://www.achurchnearyou.com//":
            return IgnoreRequest()
        else:
            return response

and I add this to the dict of middlewares:

DOWNLOADER_MIDDLEWARES = {
    'acny.middlewares.CustomDownloaderMiddleware': 650
}

with a value of 650, which should - I think - make it run directly after the RedirectMiddleware.

However, when I run the crawler, I get an error saying:

ERROR: Error downloading <GET http://www.achurchnearyou.com/venue.php?V=00001>: process_response() got multiple values for keyword argument 'request'

This error is occurring on the very first page crawled, and I can't work out why it is occurring - I think I've followed what the manual said to do. What am I doing wrong?

like image 530
robintw Avatar asked Dec 09 '25 22:12

robintw


1 Answers

I've found the solution to my own problem - it was a silly mistake with creating the class and method in Python. The code above needs to be:

from scrapy.exceptions import IgnoreRequest
from scrapy import log

class CustomDownloaderMiddleware(object):

   def process_response(self, request, response, spider):
       log.msg("In Middleware " + response.url, level=log.WARNING)
       if response.url == "http://www.achurchnearyou.com//":
           raise IgnoreRequest()
       else:
           return response

That is, there needs to be a self parameter for the method as the first parameter, and the class needs to inherit from object.

like image 185
robintw Avatar answered Dec 11 '25 14:12

robintw



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!