When the content-type
of the server is 'Content-Type:text/html'
, requests.get()
returns improperly encoded data.
However, if we have the content type explicitly as 'Content-Type:text/html; charset=utf-8'
, it returns properly encoded data.
Also, when we use urllib.urlopen()
, it returns properly encoded data.
Has anyone noticed this before? Why does requests.get()
behave like this?
UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.
To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.
In Python, Strings are by default in utf-8 format which means each alphabet corresponds to a unique code point.
For response header Content-Type: text/html; charset=utf-8 the result is UTF-8. Luckily for us, requests uses chardet library and that usually works quite well (attribute requests.Response.apparent_encoding ), so you usually want to do:
response.text – Python requests. response.text returns the content of the response, in unicode. Basically, it refers to Binary Response content. Python requests are generally used to fetch the content from a particular resource URI. Whenever we make a request to a specified URI through Python, it returns a response object.
Check the content at the start of output, it shows the entire content in unicode. There are many libraries to make an HTTP request in Python, which are httplib, urllib, httplib2, treq, etc., but requests is the one of the best with cool features. If any attribute of requests shows NULL, check the status code using below attribute.
The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property. Check the encoding requests used for your page, and if it's not the right one - try to force it to be the one you need.
Educated guesses (mentioned above) are probably just a check for Content-Type
header as being sent by server (quite misleading use of educated imho).
For response header Content-Type: text/html
the result is ISO-8859-1 (default for HTML4), regardless any content analysis (ie. default for HTML5 is UTF-8).
For response header Content-Type: text/html; charset=utf-8
the result is UTF-8.
Luckily for us, requests uses chardet library and that usually works quite well (attribute requests.Response.apparent_encoding
), so you usually want to do:
r = requests.get("https://martin.slouf.name/") # override encoding by real educated guess as provided by chardet r.encoding = r.apparent_encoding # access the data r.text
From requests documentation:
When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property.
>>> r.encoding 'utf-8' >>> r.encoding = 'ISO-8859-1'
Check the encoding requests used for your page, and if it's not the right one - try to force it to be the one you need.
Regarding the differences between requests
and urllib.urlopen
- they probably use different ways to guess the encoding. Thats all.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With