Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy:In a request fails (eg 404,500), how to ask for another alternative request?

I have a problem with scrapy. In a request fails (eg 404,500), how to ask for another alternative request? Such as two links can obtain price info, the one failed, request another automatically.

like image 564
Zhang Jiuzhou Avatar asked Jun 04 '13 02:06

Zhang Jiuzhou


People also ask

How do you get a response from Scrapy request?

Request usage examples You can use the FormRequest. from_response() method for this job. Here's an example spider which uses it: import scrapy def authentication_failed(response): # TODO: Check the contents of the response and return True if it failed # or False if it succeeded.

How do you do a delayed request on Scrapy?

if you want to keep a download delay of exactly one second, setting DOWNLOAD_DELAY=1 is the way to do it. But scrapy also has a feature to automatically set download delays called AutoThrottle . It automatically sets delays based on load of both the Scrapy server and the website you are crawling.

How do you make a Scrapy request?

Making a request is a straightforward process in Scrapy. To generate a request, you need the URL of the webpage from which you want to extract useful data. You also need a callback function. The callback function is invoked when there is a response to the request.


2 Answers

Use "errback" in the Request like errback=self.error_handler where error_handler is a function (just like callback function) in this function check the error code and make the alternative Request.

see errback in the scrapy documentation: http://doc.scrapy.org/en/latest/topics/request-response.html

like image 183
Omair Shamshir Avatar answered Sep 30 '22 17:09

Omair Shamshir


Just set handle_httpstatus_list = [404, 500] and check for the status code in parse method. Here's an example:

from scrapy.http import Request
from scrapy.spider import BaseSpider


class MySpider(BaseSpider):
    handle_httpstatus_list = [404, 500]
    name = "my_crawler"

    start_urls = ["http://github.com/illegal_username"]

    def parse(self, response):
        if response.status in self.handle_httpstatus_list:
            return Request(url="https://github.com/kennethreitz/", callback=self.after_404)

    def after_404(self, response):
        print response.url

        # parse the page and extract items

Also see:

  • How to get the scrapy failure URLs?
  • Scrapy and response status code: how to check against it?
  • How to retry for 404 link not found in scrapy?

Hope that helps.

like image 39
alecxe Avatar answered Sep 30 '22 17:09

alecxe