Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Requests.get in Python using "User-Agent" not simulating a browser request

I have to collect information from webpages using Python from a Linux terminal, it works wonderful but some pages (not all of them) are retrieving invalid URL's when I try to use requests.get due to they have agents detectors and they don't know how to answer my request (I'm not a browser or mobile application from a Linux terminal).

Using "User-Agent" header didn't work either, I tried several different ways to send it to emulate I am a Mozilla browser:

user_agent = {'User-Agent': 'Mozilla/5.0'}

or

user_agent = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; hu-HU; rv:1.7.8) Gecko/20050511 Firefox/1.0.4'}

or many other combinations.

In some servers when I try to use this line:

page = requests.get(url, headers=user_agent)

I get a bad request, because these servers try to send me a webpage for desktop or mobile browsers and they fail to identify it.

Am I doing something wrong sending a User-Agent in this way? I tried my code in a Python Notebook and it works perfectly due to I'm currently (of course) sending a request from a browser.

like image 895
Maximiliano Rios Avatar asked May 26 '14 21:05

Maximiliano Rios


People also ask

What user agent does Python requests use?

Python's Requests Default 'User-Agent' utils. default_headers() - sample output - {'User-Agent': 'python-requests/2.27.

How does Python requests get work?

It works as a request-response protocol between a client and a server. Let's demonstrate how to make a GET request to an endpoint. GET method is used to retrieve information from the given server using a given URI. The GET method sends the encoded user information appended to the page request.

What is the alternative for requests module in Python?

Top Alternatives to requestsPython wrapper for the Cloudflare v4 API. Pytest: simple powerful testing with Python. Powerful data structures for data analysis, time series, and statistics. The modular source code checker: pep8, pyflakes and co.

How do you get a response from a Python request?

When one makes a request to a URI, it returns a response. This Response object in terms of python is returned by requests. method(), method being – get, post, put, etc.


1 Answers

You are using a very old user agent and indeed some sites will block you because of this.

>>> import requests
>>> header = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0',}
>>> url = 'http://www.w3.org/'
>>> r = requests.get(url, headers=header)
>>> r.headers
CaseInsensitiveDict({'content-length': '40737', 'content-location': 'Home.html', 'accept-ranges': 'bytes', 'expires': 'Tue, 24 Jun 2014 04:44:36 GMT', 'vary': 'negotiate,accept', 'server': 'Apache/2', 'tcn': 'choice', 'last-modified': 'Mon, 23 Jun 2014 11:15:15 GMT', 'etag': '"9f21-4fc7ef51956c0;89-3f26bd17a2f00"', 'cache-control': 'max-age=600', 'date': 'Tue, 24 Jun 2014 04:34:36 GMT', 'p3p': 'policyref="http://www.w3.org/2001/05/P3P/p3p.xml"', 'content-type': 'text/html; charset=utf-8'})
>>> r.request.headers
CaseInsensitiveDict({'Accept-Encoding': 'gzip, deflate, compress', 'Accept': '*/*', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0'})
>>> 
like image 128
karlcow Avatar answered Oct 01 '22 17:10

karlcow