As per these instructions I can see that HTTP 500 errors, connection lost errors etc. are always rescheduled but I couldn't find anywhere if 403 error are rescheduled too or if they are simply treated as a valid response or ignored after reaching the retry limits.
Also from the same instruction:
Failed pages are collected on the scraping process and rescheduled at the end, once the spider has finished crawling all regular (non failed) pages. Once there are no more failed pages to retry, this middleware sends a signal (retry_complete), so other extensions could connect to that signal.
What does these Failed Pages
refer to ? Do they include 403 errors ?
Also, I can see this exception being raised when scrapy encounters a HTTP 400 status:
2015-12-07 12:33:42 [scrapy] DEBUG: Ignoring response <400 http://example.com/q?x=12>: HTTP status code is not handled or not allowed
From this exception I think it's clear that HTTP 400 responses are ignored and not rescheduled.
I'm not sure if 403 HTTP status is ignored or rescheduled to be crawled at the end. So I tried rescheduling all the responses that have HTTP status 403 according to these docs. Here's what I have tried so far:
In a middlewares.py file:
def process_response(self, request, response, spider):
if response.status == 403:
return request
else:
return response
In the settings.py:
RETRY_TIMES = 5
RETRY_HTTP_CODES = [500, 502, 503, 504, 400, 403, 404, 408]
My questions are:
Failed Pages
refer to ? Do they include 403 errors ? process_response
to reschedule 403 error pages or are they automatically rescheduled by scrapy ?You can find the default statuses to retry here.
Adding 403 to RETRY_HTTP_CODES
in the settings.py
file should handle that request and retry.
The ones inside the RETRY_HTTP_CODES
, we already checked the default ones.
RETRY_TIMES
handles how many times to try an error page, by default it is set to 2
, and you can override it on the settings.py
file.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With