I'm building a broken link checker in python, and it's becoming a chore building the logic for correctly identifying links that do not resolve when visited with a browser. I've found a set of links where I can consistently reproduce a redirect error with my scraper, but which resolve perfectly when visited in a browser. I was hoping I could find some insight here.
import urllib
import urllib.request
import html.parser
import requests
from requests.exceptions import HTTPError
from socket import error as SocketError
try:
req=urllib.request.Request(url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
response = urllib.request.urlopen(req)
raw_response = response.read().decode('utf8', errors='ignore')
response.close()
except urllib.request.HTTPError as inst:
output = format(inst)
print(output)
In this instance, an example of a URL that reliably returns this error is 'http://forums.hostgator.com/want-see-your-sites-dns-propagating-t48838.html'. It resolves perfectly when visited but the code above will return the following error:
HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently
Any ideas how I can correctly identify these links as functional without blindly ignoring links from that site (which might miss genuinely broken links)?
The data returned by urlopen() or urlretrieve() is the raw data returned by the server. This may be binary data (such as an image), plain text or (for example) HTML. The HTTP protocol provides type information in the reply header, which can be inspected by looking at the Content-Type header.
Urllib package is the URL handling module for python. It is used to fetch URLs (Uniform Resource Locators). It uses the urlopen function and is able to fetch URLs using a variety of different protocols. Urllib is a package that collects several modules for working with URLs, such as: urllib.
You get the infinite loop error because the page you want to scrape uses cookies and redirects when the cookie isn't sent by the client. You'll get the same error with most other scraper tools and also with browsers when you disallow cookies.
You need a http.cookiejar.CookieJar
and a urllib.request.HTTPCookieProcessor
to avoid the redirect loop:
import urllib
import urllib.request
import html.parser
import requests
from requests.exceptions import HTTPError
from socket import error as SocketError
from http.cookiejar import CookieJar
try:
req=urllib.request.Request(url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
response = opener.open(req)
raw_response = response.read().decode('utf8', errors='ignore')
response.close()
except urllib.request.HTTPError as inst:
output = format(inst)
print(output)
I concur with the comments in the 1st answer and it wasn't working for me (I was getting some encoded/compressed byte data, nothing readable)
The link mentioned used urllib2. It also works with urllib in python 3.7 as follow:
from urllib.request import build_opener, HTTPCookieProcessor
opener = build_opener(HTTPCookieProcessor())
response = opener.open('http://www.bad.org.uk')
print response.read()
I tried the solutions above without success.
It appears that this problem can occur when the URL you are trying to open is badly formed (or just not what the REST service is expecting). For example, I found my problem was because I requested https://host.com/users/4484486
where the host was expecting a slash at the end: https://host.com/users/4484486/
solved the problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With