urlopen Returning Redirect Error for Valid Links

Tags:

I'm building a broken link checker in python, and it's becoming a chore building the logic for correctly identifying links that do not resolve when visited with a browser. I've found a set of links where I can consistently reproduce a redirect error with my scraper, but which resolve perfectly when visited in a browser. I was hoping I could find some insight here.

Click to copy

import urllib
import urllib.request
import html.parser
import requests
from requests.exceptions import HTTPError
from socket import error as SocketError

try:
    req=urllib.request.Request(url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
    response = urllib.request.urlopen(req)
    raw_response = response.read().decode('utf8', errors='ignore')
    response.close()
except urllib.request.HTTPError as inst:
    output = format(inst)


print(output)

In this instance, an example of a URL that reliably returns this error is 'http://forums.hostgator.com/want-see-your-sites-dns-propagating-t48838.html'. It resolves perfectly when visited but the code above will return the following error:

Click to copy

HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently

Any ideas how I can correctly identify these links as functional without blindly ignoring links from that site (which might miss genuinely broken links)?

367

asked Sep 14 '15 16:09

David Scott

3 Answers

You get the infinite loop error because the page you want to scrape uses cookies and redirects when the cookie isn't sent by the client. You'll get the same error with most other scraper tools and also with browsers when you disallow cookies.

You need a http.cookiejar.CookieJar and a urllib.request.HTTPCookieProcessor to avoid the redirect loop:

Click to copy

import urllib
import urllib.request
import html.parser
import requests
from requests.exceptions import HTTPError
from socket import error as SocketError
from http.cookiejar import CookieJar

try:
    req=urllib.request.Request(url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
    cj = CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
    response = opener.open(req)
    raw_response = response.read().decode('utf8', errors='ignore')
    response.close()
except urllib.request.HTTPError as inst:
    output = format(inst)
    print(output)

111

answered Dec 09 '22 16:12

cg909

I concur with the comments in the 1st answer and it wasn't working for me (I was getting some encoded/compressed byte data, nothing readable)

The link mentioned used urllib2. It also works with urllib in python 3.7 as follow:

Click to copy

from urllib.request import build_opener, HTTPCookieProcessor
opener = build_opener(HTTPCookieProcessor())
response = opener.open('http://www.bad.org.uk')
print response.read()

answered Dec 09 '22 17:12

MrE

I tried the solutions above without success.

It appears that this problem can occur when the URL you are trying to open is badly formed (or just not what the REST service is expecting). For example, I found my problem was because I requested https://host.com/users/4484486 where the host was expecting a slash at the end: https://host.com/users/4484486/ solved the problem.

answered Dec 09 '22 17:12

Peter

Related questions
                            
                                Python: passing argument to generator object created by generator expression?
                            
                                How to limit number of CPU's used by a python script w/o terminal or multiprocessing library?
                            
                                How to normalize a relative path using pathlib
                            
                                Module Not Found Error: No module named 'src'
                            
                                How can I upload svg file to a django app?
                            
                                How to run python3 code in VSCode? /bin/sh: 1: python: not found
                            
                                FastAPI/Pydantic accept arbitrary post request body?
                            
                                Jupyter Notebook: Terminals not available
                            
                                Using list instead of tuple in module __all__
                            
                                Why is Python 3 (or later) better than Python 2?
                            
                                Superclass of bytes and bytearray?
                            
                                What does 'while' with an integer mean in Python and how does this GCD code work?
                            
                                Python C API: Using PyEval_EvalCode
                            
                                How to get the width of tkinter widget?
                            
                                Installing Pillow for Python on Windows
                            
                                listing elements in a nested lists diagonally [duplicate]
                            
                                Python: reduce (list of strings) -> string
                            
                                How do I call outside function from class in python
                            
                                Limiting print output
                            
                                Apache Thrift Python 3 support

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

urlopen Returning Redirect Error for Valid Links

Tags:

python-3.x

httprequest

urllib

David Scott

People also ask

3 Answers

cg909

MrE

Peter

Recent Activity

Donate For Us