I have a middleware that will [raise IgnoreRequests()] if url contains "https".
class MiddlewareSkipHTTPS(object):
def process_response(self, request, response, spider):
if (response.url.find("https") > -1):
raise IgnoreRequest()
else:
return response
Is there a way to completely prevent scrapy from performing a GET request to the HTTPS url? I get the same values for response_bytes/response_count without [IgnoreRequests()] and with it my code snippet. I'm looking for zero values and skip crawling the url. I don't want scrapy to crawl/download all the bytes from the https page, just move on to the next url.
Notes: MUST be a middleware, do not want to use rules embedded in spider. Have hundreds of spiders and want to consolidate the logic.
Do not use process_response
, it is called after a request has been already made.
You need to use
def process_request(request, spider):
request.url # URL being scraped
This method called before a request is actually made.
See here
https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.DownloaderMiddleware.process_request
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With