Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python urllib.request - headers that are likely to work

Working on a little script to fetch info from websites. I'm having trouble with HTTP errors.

req = urllib.request.Request(lnk['href'],
   headers={'User-Agent': 'Mozilla/5.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'})
page = urllib.request.urlopen(req)

When this triest to fetch, for example, http://www.guru99.com/node-js-tutorial.html I get a long series of errors, ending with 406 Unacceptable:

Traceback (most recent call last):
  File "get_links.py", line 45, in <module>
    page = urllib.request.urlopen(req)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 162, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 471, in open
    response = meth(req, response)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 581, in http_response
    'http', request, response, code, msg, hdrs)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 509, in error
    return self._call_chain(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 443, in _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 589, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 406: Not Acceptable

Googling around I have found that I should fix the headers (as I have done above) and lots of tutorials about how to fix the headers. Except - not much actually works.

Is there some set of good headers which are likely to not cause a problem with most sites? Is there some python module someone else has created that already includes commonly-working headers? Is there a good way to retry several times with different headers until you get a good response?

This seems like a problem everybody who does web scraping with Python deals with, and I haven't found a decent solution.

like image 667
Adam Michael Wood Avatar asked Oct 12 '25 02:10

Adam Michael Wood


2 Answers

The following set of headers seems to be working for most tested. If anyone else has suggestions, please offer them. I'm also interested in good solutions for trying different headers if one set doesn't work.

req = urllib.request.Request(lnk['href'],
   headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
page = urllib.request.urlopen(req)
like image 99
Adam Michael Wood Avatar answered Oct 14 '25 17:10

Adam Michael Wood


HTTP Error 406 Not acceptable

The HyperText Transfer Protocol (HTTP) 406 Not Acceptable client error response code indicates that the server cannot produce a response matching the list of acceptable values defined in the request's proactive content negotiation headers, and that the server is unwilling to supply a default representation.

So I can see the issue is with your both User-Agent: Mozilla/5.0 key and value. Here are the links of the bunch of correct User Agents,

  • devicesatlsas.com
  • developer.chrome.com
  • developer.mozilla.org

So change your code to the following,

headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'})

I know the answer is too late but hope this helps someone else.

like image 34
Hasitha Jayawardana Avatar answered Oct 14 '25 17:10

Hasitha Jayawardana



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!