Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a Python library function which attempts to guess the character-encoding of some bytes? [duplicate]

I'm writing some mail-processing software in Python that is encountering strange bytes in header fields. I suspect this is just malformed mail; the message itself claims to be us-ascii, so I don't think there is a true encoding, but I'd like to get out a unicode string approximating the original one without throwing a UnicodeDecodeError.

So, I'm looking for a function that takes a str and optionally some hints and does its darndest to give me back a unicode. I could write one of course, but if such a function exists its author has probably thought a bit deeper about the best way to go about this.

I also know that Python's design prefers explicit to implicit and that the standard library is designed to avoid implicit magic in decoding text. I just want to explicitly say "go ahead and guess".

like image 952
Nick Avatar asked Nov 06 '08 15:11

Nick


People also ask

How do you determine character encoding?

One way to check this is to use the W3C Markup Validation Service. The validator usually detects the character encoding from the HTTP headers and information in the document. If the validator fails to detect the encoding, it can be selected on the validator result page via the 'Encoding' pulldown menu (example).

What is character encoding in Python give an example?

A character encoding is one specific way of interpreting bytes: It's a look-up table that says, for example, that a byte with the value 97 stands for 'a'. In Python 2 1, this is a str object: a series of bytes without any information about how they should be interpreted.

How do you get the UTF-8 character code in Python?

UTF-8 is a variable-length encoding, so I'll assume you really meant "Unicode code point". Use chr() to convert the character code to a character, decode it, and use ord() to get the code point. In Python 2, chr only supports ASCII, so only numbers in the [0.. 255] range.


1 Answers

+1 for the chardet module (suggested by @insin).

It is not in the standard library, but you can easily install it with the following command:

$ pip install chardet

Example:

>>> import chardet
>>> import urllib
>>> detect = lambda url: chardet.detect(urllib.urlopen(url).read())
>>> detect('http://stackoverflow.com')
{'confidence': 0.85663169917190185, 'encoding': 'ISO-8859-2'}    
>>> detect('https://stackoverflow.com/questions/269060/is-there-a-python-lib')
{'confidence': 0.98999999999999999, 'encoding': 'utf-8'}

See Installing Pip if you don't have one.

like image 129
jfs Avatar answered Sep 18 '22 19:09

jfs