Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Requests: requests.exceptions.TooManyRedirects: Exceeded 30 redirects

Tags:

I was trying to crawl this page using python-requests library

import requests from lxml import etree,html  url = 'http://www.amazon.in/b/ref=sa_menu_mobile_elec_all?ie=UTF8&node=976419031' r = requests.get(url) tree = etree.HTML(r.text) print tree 

but I got above error. (TooManyRedirects) I tried to use allow_redirects parameter but same error

r = requests.get(url, allow_redirects=True)

I even tried to send headers and data alongwith url but I'm not sure if this is correct way to do it.

headers = {'content-type': 'text/html'} payload = {'ie':'UTF8','node':'976419031'} r = requests.post(url,data=payload,headers=headers,allow_redirects=True) 

how to resolve this error. I've even tried beautiful-soup4 out of curiosity and I got different but same kind of error

page = BeautifulSoup(urllib2.urlopen(url))

urllib2.HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Moved Permanently 
like image 534
user3628682 Avatar asked May 14 '14 10:05

user3628682


2 Answers

Amazon is redirecting your request to http://www.amazon.in/b?ie=UTF8&node=976419031, which in turn redirects to http://www.amazon.in/electronics/b?ie=UTF8&node=976419031, after which you have entered a loop:

>>> loc = url >>> seen = set() >>> while True: ...     r = requests.get(loc, allow_redirects=False) ...     loc = r.headers['location'] ...     if loc in seen: break ...     seen.add(loc) ...     print loc ...  http://www.amazon.in/b?ie=UTF8&node=976419031 http://www.amazon.in/electronics/b?ie=UTF8&node=976419031 >>> loc http://www.amazon.in/b?ie=UTF8&node=976419031 

So your original URL A redirects no a new URL B, which redirects to C, which redirects to B, etc.

Apparently Amazon does this based on the User-Agent header, at which point it sets a cookie that following requests should send back. The following works:

>>> s = requests.Session() >>> s.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36' >>> r = s.get(url) >>> r <Response [200]> 

This created a session (for ease of re-use and for cookie persistence), and a copy of the Chrome user agent string. The request succeeds (returns a 200 response).

like image 129
Martijn Pieters Avatar answered Sep 19 '22 19:09

Martijn Pieters


Increase of max_redirect is possible by explicitly specifying the count as in example below:

session = requests.Session() session.max_redirects = 60 session.get('http://www.amazon.com') 
like image 36
PrabaKaran D Avatar answered Sep 18 '22 19:09

PrabaKaran D