What is a nice, reliable short way to get the charset of a webpage?

Question

I'm a bit surprised that it's so complicated to get a charset of a webpage with Python. Am I missing a way? The HTTPMessage has loads of functions, but not this.

>>> google = urllib2.urlopen('http://www.google.com/')
>>> google.headers.gettype()
'text/html'
>>> google.headers.getencoding()
'7bit'
>>> google.headers.getcharset()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: HTTPMessage instance has no attribute 'getcharset'

So you have to get the header, and split it. Twice.

>>> google = urllib2.urlopen('http://www.google.com/')
>>> charset = 'ISO-8859-1'
>>> contenttype = google.headers.getheader('Content-Type', '')
>>> if ';' in contenttype:
...     charset = contenttype.split(';')[1].split('=')[1]
>>> charset
'ISO-8859-1'

That's a surprising amount of steps for such a basic function. Am I missing something?

Leniel Maccaferri · Accepted Answer

Have you checked this?

How to download any(!) webpage with correct charset in python?

Elias Zamaria · Answer

I did some research and came up with this solution:

response = urllib.request.urlopen(url)
encoding = response.headers.get_content_charset()

This is how I would do it in Python 3. I haven't tested it in Python 2 but I am guessing that you would have to use urllib2.request instead of urllib.request.

Here is how it works, since the official Python documentation doesn't explain it very well: the result of urlopen is an http.client.HTTPResponse object. The headers property of this object is an http.client.HTTPMessage object, which, according to the documentation, "is implemented using the email.message.Message class", which has a method called get_content_charset, which tries to determine and return the character set of the response.

By default, this method returns None if it is unable to determine the character set, but you can override this behavior instead by passing a failobj parameter:

encoding = response.headers.get_content_charset(failobj="utf-8")

What is a nice, reliable short way to get the charset of a webpage?

Tags:

python

content-type

http

Lennart Regebro

2 Answers

Leniel Maccaferri

Elias Zamaria

Recent Activity

Donate For Us

What is a nice, reliable short way to get the charset of a webpage?

Tags:

python

content-type

http

Lennart Regebro

2 Answers

Leniel Maccaferri

Elias Zamaria

Related questions

Recent Activity

Donate For Us