Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python requests.get() returns improperly decoded text instead of UTF-8?

When the content-type of the server is 'Content-Type:text/html', requests.get() returns improperly encoded data.

However, if we have the content type explicitly as 'Content-Type:text/html; charset=utf-8', it returns properly encoded data.

Also, when we use urllib.urlopen(), it returns properly encoded data.

Has anyone noticed this before? Why does requests.get() behave like this?

like image 300
arunk2 Avatar asked May 26 '17 13:05

arunk2


People also ask

What does encoding =' UTF-8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.

How do I decode a UTF-8 string in Python?

To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.

Are Python strings UTF-8?

In Python, Strings are by default in utf-8 format which means each alphabet corresponds to a unique code point.

What is the best way to get UTF-8 encoding for HTTP requests?

For response header Content-Type: text/html; charset=utf-8 the result is UTF-8. Luckily for us, requests uses chardet library and that usually works quite well (attribute requests.Response.apparent_encoding ), so you usually want to do:

What is response text in Python?

response.text – Python requests. response.text returns the content of the response, in unicode. Basically, it refers to Binary Response content. Python requests are generally used to fetch the content from a particular resource URI. Whenever we make a request to a specified URI through Python, it returns a response object.

How to check if a Python HTTP request is Unicode or not?

Check the content at the start of output, it shows the entire content in unicode. There are many libraries to make an HTTP request in Python, which are httplib, urllib, httplib2, treq, etc., but requests is the one of the best with cool features. If any attribute of requests shows NULL, check the status code using below attribute.

How do I change the text encoding used by requests?

The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property. Check the encoding requests used for your page, and if it's not the right one - try to force it to be the one you need.


2 Answers

Educated guesses (mentioned above) are probably just a check for Content-Type header as being sent by server (quite misleading use of educated imho).

For response header Content-Type: text/html the result is ISO-8859-1 (default for HTML4), regardless any content analysis (ie. default for HTML5 is UTF-8).

For response header Content-Type: text/html; charset=utf-8 the result is UTF-8.

Luckily for us, requests uses chardet library and that usually works quite well (attribute requests.Response.apparent_encoding), so you usually want to do:

r = requests.get("https://martin.slouf.name/") # override encoding by real educated guess as provided by chardet r.encoding = r.apparent_encoding # access the data r.text 
like image 137
bubak Avatar answered Oct 06 '22 01:10

bubak


From requests documentation:

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property.

>>> r.encoding 'utf-8' >>> r.encoding = 'ISO-8859-1' 

Check the encoding requests used for your page, and if it's not the right one - try to force it to be the one you need.

Regarding the differences between requests and urllib.urlopen - they probably use different ways to guess the encoding. Thats all.

like image 27
Dekel Avatar answered Oct 06 '22 02:10

Dekel