Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy Middleware to ignore URL and prevent crawling

Tags:

python

scrapy

I have a middleware that will [raise IgnoreRequests()] if url contains "https".

class MiddlewareSkipHTTPS(object):
    def process_response(self, request, response, spider):
        if (response.url.find("https") > -1):
            raise IgnoreRequest()
        else:
            return response 

enter image description here

Is there a way to completely prevent scrapy from performing a GET request to the HTTPS url? I get the same values for response_bytes/response_count without [IgnoreRequests()] and with it my code snippet. I'm looking for zero values and skip crawling the url. I don't want scrapy to crawl/download all the bytes from the https page, just move on to the next url.

Notes: MUST be a middleware, do not want to use rules embedded in spider. Have hundreds of spiders and want to consolidate the logic.

like image 512
invulnarable27 Avatar asked Apr 03 '17 08:04

invulnarable27


1 Answers

Do not use process_response, it is called after a request has been already made.

You need to use

def process_request(request, spider):
     request.url # URL being scraped

This method called before a request is actually made.

See here

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.DownloaderMiddleware.process_request

like image 114
Umair Ayub Avatar answered Oct 08 '22 10:10

Umair Ayub