When the <code>content-type</code> of the server is <code>'Content-Type:text/html'</code>, <code>requests.get()</code> returns improperly encoded data. However, if we have the content type explicitly as <code>'Content-Type:text/html; charset=utf-8'</code>, it returns properly encoded data. Also, when we use <code>urllib.urlopen()</code>, it returns properly encoded data. Has anyone noticed this before? Why does <code>requests.get()</code> behave like this?

Educated guesses (mentioned above) are probably just a check for <code>Content-Type</code> header as being sent by server (quite misleading use of educated imho). For response header <code>Content-Type: text/html</code> the result is ISO-8859-1 (default for HTML4), regardless any content analysis (ie. default for HTML5 is UTF-8). For response header <code>Content-Type: text/html; charset=utf-8</code> the result is UTF-8. Luckily for us, requests uses chardet library and that usually works quite well (attribute <code>requests.Response.apparent_encoding</code>), so you usually want to do: <pre class="prettyprint"><code>r = requests.get("https://martin.slouf.name/") # override encoding by real educated guess as provided by chardet r.encoding = r.apparent_encoding # access the data r.text </code></pre>

From requests documentation: <blockquote> When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property. </blockquote> <pre class="prettyprint"><code>>>> r.encoding 'utf-8' >>> r.encoding = 'ISO-8859-1' </code></pre> Check the encoding requests used for your page, and if it's not the right one - try to force it to be the one you need. Regarding the differences between <code>requests</code> and <code>urllib.urlopen</code> - they probably use different ways to guess the encoding. Thats all.

python requests.get() returns improperly decoded text instead of UTF-8?

2 Answers

Educated guesses (mentioned above) are probably just a check for Content-Type header as being sent by server (quite misleading use of educated imho).

For response header Content-Type: text/html the result is ISO-8859-1 (default for HTML4), regardless any content analysis (ie. default for HTML5 is UTF-8).

For response header Content-Type: text/html; charset=utf-8 the result is UTF-8.

Luckily for us, requests uses chardet library and that usually works quite well (attribute requests.Response.apparent_encoding), so you usually want to do:

r = requests.get("https://martin.slouf.name/") # override encoding by real educated guess as provided by chardet r.encoding = r.apparent_encoding # access the data r.text

137

answered Oct 06 '22 01:10

bubak

From requests documentation:

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property.

>>> r.encoding 'utf-8' >>> r.encoding = 'ISO-8859-1'

Check the encoding requests used for your page, and if it's not the right one - try to force it to be the one you need.

Regarding the differences between requests and urllib.urlopen - they probably use different ways to guess the encoding. Thats all.

answered Oct 06 '22 02:10

Dekel

Related questions
                            
                                python 3 try-except all with error [duplicate]
                            
                                Is close() necessary when using iterator on a Python file object [duplicate]
                            
                                How to create a delayed queue in RabbitMQ?
                            
                                Get a list of all installed applications in Django and their attributes
                            
                                how to add annotate data in django-rest-framework queryset responses?
                            
                                python: scatter plot logarithmic scale
                            
                                Page not found 404 Django media files
                            
                                Selenium testing without browser
                            
                                Check if all values of iterable are zero
                            
                                Python pandas: how to remove nan and -inf values
                            
                                How can I create stacked line graph with matplotlib?
                            
                                Most Pythonic way to concatenate strings
                            
                                How to fetch a non-ascii url with urlopen?
                            
                                subprocess: deleting child processes in Windows
                            
                                Built-in module to calculate the least common multiple
                            
                                Python: How to ignore #comment lines when reading in a file
                            
                                NumPy version of "Exponential weighted moving average", equivalent to pandas.ewm().mean()
                            
                                run a python script in terminal without the python command
                            
                                Python MySQLdb TypeError: not all arguments converted during string formatting
                            
                                Get the string within brackets in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python requests.get() returns improperly decoded text instead of UTF-8?

Tags:

python

utf-8

python-requests

arunk2

People also ask

2 Answers

bubak

Dekel

Recent Activity

Donate For Us