Scrapy Middleware to ignore URL and prevent crawling

Question

I have a middleware that will [raise IgnoreRequests()] if url contains "https".

class MiddlewareSkipHTTPS(object):
    def process_response(self, request, response, spider):
        if (response.url.find("https") > -1):
            raise IgnoreRequest()
        else:
            return response

enter image description here

Is there a way to completely prevent scrapy from performing a GET request to the HTTPS url? I get the same values for response_bytes/response_count without [IgnoreRequests()] and with it my code snippet. I'm looking for zero values and skip crawling the url. I don't want scrapy to crawl/download all the bytes from the https page, just move on to the next url.

Notes: MUST be a middleware, do not want to use rules embedded in spider. Have hundreds of spiders and want to consolidate the logic.

Umair Ayub · Accepted Answer

Do not use process_response, it is called after a request has been already made.

You need to use

def process_request(request, spider):
     request.url # URL being scraped

This method called before a request is actually made.

See here

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.DownloaderMiddleware.process_request

Scrapy Middleware to ignore URL and prevent crawling

Tags:

python

scrapy

invulnarable27

1 Answers

Umair Ayub

Recent Activity

Donate For Us

Scrapy Middleware to ignore URL and prevent crawling

Tags:

python

scrapy

invulnarable27

1 Answers

Umair Ayub

Related questions

Recent Activity

Donate For Us