Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to handle 302 redirect in scrapy

I am receiving a 302 response from a server while scrapping a website:

2014-04-01 21:31:51+0200 [ahrefs-h] DEBUG: Redirecting (302) to <GET http://www.domain.com/Site_Abuse/DeadEnd.htm> from <GET http://domain.com/wps/showmodel.asp?Type=15&make=damc&a=664&b=51&c=0>

I want to send request to GET urls instead of being redirected. Now I found this middleware:

https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/downloadermiddleware/redirect.py#L31

I added this redirect code to my middleware.py file and I added this into settings.py:

DOWNLOADER_MIDDLEWARES = {
 'street.middlewares.RandomUserAgentMiddleware': 400,
 'street.middlewares.RedirectMiddleware': 100,
 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}

But I am still getting redirected. Is that all I have to do in order to get this middleware working? Do I miss something?

like image 954
mrki Avatar asked Apr 01 '14 19:04

mrki


People also ask

How do you solve a 302 redirect?

You can follow these five steps to fix HTTP 302 errors on your website: Determine whether the redirects are appropriate or not by examining the URLs that are issuing the 302 redirects. Check your plugins to make sure any redirect settings are valid. Ensure that your WordPress URL settings are configured correctly.

How do I redirect on Scrapy?

Configuration. Install scrapy-redirect in your Scrapy middlewares by adding the following key/value pair in the SPIDER_MIDDLEWARES settings key (in settings.py): SPIDER_MIDDLEWARES = { ... 'scrapyredirect.

Does a 302 automatically redirect?

What is an HTTP 302? The 302 status code is a redirection message that occurs when a resource or page you're attempting to load has been temporarily moved to a different location. It's usually caused by the web server and doesn't impact the user experience, as the redirect happens automatically.

How do you handle Scrapy 301?

You can set setting HTTPERROR_ALLOWED_CODES = [301,302,...] in settings.py file. Or if you want to enable it for all codes you can set HTTPERROR_ALLOW_ALL = True instead.


3 Answers

Forgot about middlewares in this scenario, this will do the trick:

meta = {'dont_redirect': True,'handle_httpstatus_list': [302]}

That said, you will need to include meta parameter when you yield your request:

yield Request(item['link'],meta = {
                  'dont_redirect': True,
                  'handle_httpstatus_list': [302]
              }, callback=self.your_callback)
like image 59
mrki Avatar answered Sep 19 '22 11:09

mrki


An unexplicable 302 response, such as redirecting from a page that loads fine in a web browser to the home page or some fixed page, usually indicates a server-side measure against undesired activity.

You must either reduce your crawl rate or use a smart proxy (e.g. Crawlera) or a proxy-rotation service and retry your requests when you get such a response.

To retry such a response, add 'handle_httpstatus_list': [302] to the meta of the source request, and check if response.status == 302 in the callback. If it is, retry your request by yielding response.request.replace(dont_filter=True).

When retrying, you should also make your code limit the maximum number of retries of any given URL. You could keep a dictionary to track retries:

class MySpider(Spider):
    name = 'my_spider'

    max_retries = 2

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.retries = {}

    def start_requests(self):
        yield Request(
            'https://example.com',
            callback=self.parse,
            meta={
                'handle_httpstatus_list': [302],
            },
        )

    def parse(self, response):
        if response.status == 302:
            retries = self.retries.setdefault(response.url, 0)
            if retries < self.max_retries:
                self.retries[response.url] += 1
                yield response.request.replace(dont_filter=True)
            else:
                self.logger.error('%s still returns 302 responses after %s retries',
                                  response.url, retries)
            return

Depending on the scenario, you might want to move this code to a downloader middleware.

like image 30
Gallaecio Avatar answered Sep 23 '22 11:09

Gallaecio


You can disable the RedirectMiddleware by setting REDIRECT_ENABLED to False in settings.py

like image 41
Steven Almeroth Avatar answered Sep 22 '22 11:09

Steven Almeroth