python urllib.request - headers that are likely to work

Question

Working on a little script to fetch info from websites. I'm having trouble with HTTP errors.

req = urllib.request.Request(lnk['href'],
   headers={'User-Agent': 'Mozilla/5.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'})
page = urllib.request.urlopen(req)

When this triest to fetch, for example, http://www.guru99.com/node-js-tutorial.html I get a long series of errors, ending with 406 Unacceptable:

Traceback (most recent call last):
  File "get_links.py", line 45, in <module>
    page = urllib.request.urlopen(req)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 162, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 471, in open
    response = meth(req, response)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 581, in http_response
    'http', request, response, code, msg, hdrs)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 509, in error
    return self._call_chain(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 443, in _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 589, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 406: Not Acceptable

Googling around I have found that I should fix the headers (as I have done above) and lots of tutorials about how to fix the headers. Except - not much actually works.

Is there some set of good headers which are likely to not cause a problem with most sites? Is there some python module someone else has created that already includes commonly-working headers? Is there a good way to retry several times with different headers until you get a good response?

This seems like a problem everybody who does web scraping with Python deals with, and I haven't found a decent solution.

Adam Michael Wood · Accepted Answer

The following set of headers seems to be working for most tested. If anyone else has suggestions, please offer them. I'm also interested in good solutions for trying different headers if one set doesn't work.

req = urllib.request.Request(lnk['href'],
   headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
page = urllib.request.urlopen(req)

Hasitha Jayawardana · Answer

HTTP Error 406 Not acceptable

The HyperText Transfer Protocol (HTTP) 406 Not Acceptable client error response code indicates that the server cannot produce a response matching the list of acceptable values defined in the request's proactive content negotiation headers, and that the server is unwilling to supply a default representation.

So I can see the issue is with your both User-Agent: Mozilla/5.0 key and value. Here are the links of the bunch of correct User Agents,

devicesatlsas.com
developer.chrome.com
developer.mozilla.org

So change your code to the following,

headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'})

I know the answer is too late but hope this helps someone else.

python urllib.request - headers that are likely to work

Tags:

python

http-headers

urllib

web-scraping

Adam Michael Wood

2 Answers

Adam Michael Wood

Hasitha Jayawardana

Recent Activity

Donate For Us

python urllib.request - headers that are likely to work

Tags:

python

http-headers

urllib

web-scraping

Adam Michael Wood

2 Answers

Adam Michael Wood

Hasitha Jayawardana

Related questions

Recent Activity

Donate For Us