I'm writing some mail-processing software in Python that is encountering strange bytes in header fields. I suspect this is just malformed mail; the message itself claims to be us-ascii, so I don't think there is a true encoding, but I'd like to get out a unicode string approximating the original one without throwing a UnicodeDecodeError
.
So, I'm looking for a function that takes a str
and optionally some hints and does its darndest to give me back a unicode
. I could write one of course, but if such a function exists its author has probably thought a bit deeper about the best way to go about this.
I also know that Python's design prefers explicit to implicit and that the standard library is designed to avoid implicit magic in decoding text. I just want to explicitly say "go ahead and guess".
One way to check this is to use the W3C Markup Validation Service. The validator usually detects the character encoding from the HTTP headers and information in the document. If the validator fails to detect the encoding, it can be selected on the validator result page via the 'Encoding' pulldown menu (example).
A character encoding is one specific way of interpreting bytes: It's a look-up table that says, for example, that a byte with the value 97 stands for 'a'. In Python 2 1, this is a str object: a series of bytes without any information about how they should be interpreted.
UTF-8 is a variable-length encoding, so I'll assume you really meant "Unicode code point". Use chr() to convert the character code to a character, decode it, and use ord() to get the code point. In Python 2, chr only supports ASCII, so only numbers in the [0.. 255] range.
+1 for the chardet module (suggested by @insin
).
It is not in the standard library, but you can easily install it with the following command:
$ pip install chardet
Example:
>>> import chardet
>>> import urllib
>>> detect = lambda url: chardet.detect(urllib.urlopen(url).read())
>>> detect('http://stackoverflow.com')
{'confidence': 0.85663169917190185, 'encoding': 'ISO-8859-2'}
>>> detect('https://stackoverflow.com/questions/269060/is-there-a-python-lib')
{'confidence': 0.98999999999999999, 'encoding': 'utf-8'}
See Installing Pip if you don't have one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With