Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to reschedule 403 HTTP status codes to be crawled later in scrapy?

As per these instructions I can see that HTTP 500 errors, connection lost errors etc. are always rescheduled but I couldn't find anywhere if 403 error are rescheduled too or if they are simply treated as a valid response or ignored after reaching the retry limits.

Also from the same instruction:

Failed pages are collected on the scraping process and rescheduled at the end, once the spider has finished crawling all regular (non failed) pages. Once there are no more failed pages to retry, this middleware sends a signal (retry_complete), so other extensions could connect to that signal.

What does these Failed Pages refer to ? Do they include 403 errors ?

Also, I can see this exception being raised when scrapy encounters a HTTP 400 status:

2015-12-07 12:33:42 [scrapy] DEBUG: Ignoring response <400 http://example.com/q?x=12>: HTTP status code is not handled or not allowed

From this exception I think it's clear that HTTP 400 responses are ignored and not rescheduled.

I'm not sure if 403 HTTP status is ignored or rescheduled to be crawled at the end. So I tried rescheduling all the responses that have HTTP status 403 according to these docs. Here's what I have tried so far:

In a middlewares.py file:

def process_response(self, request, response, spider):
    if response.status == 403:
        return request
    else:
        return response

In the settings.py:

RETRY_TIMES = 5
RETRY_HTTP_CODES = [500, 502, 503, 504, 400, 403, 404, 408]

My questions are:

  1. What does these Failed Pages refer to ? Do they include 403 errors ?
  2. Do I need to write process_response to reschedule 403 error pages or are they automatically rescheduled by scrapy ?
  3. What type of exceptions and (HTTP codes) are rescheduled by scrapy ?
  4. If I reschedule a 404 error page, will I be entering an infinite loop or is there a timeout after which the rescheduling will not be done further ?
like image 937
Rahul Avatar asked Dec 07 '15 07:12

Rahul


1 Answers

  1. You can find the default statuses to retry here.

  2. Adding 403 to RETRY_HTTP_CODES in the settings.py file should handle that request and retry.

  3. The ones inside the RETRY_HTTP_CODES, we already checked the default ones.

  4. The RETRY_TIMES handles how many times to try an error page, by default it is set to 2, and you can override it on the settings.py file.
like image 106
eLRuLL Avatar answered Nov 15 '22 16:11

eLRuLL