I have a string say s = 'Chocolate Moelleux-M\xe8re'
When i am doing:
In [14]: unicode(s)
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 20: ordinal not in range(128)
Similarly when i am trying to decode this by using s.decode()
it returns same error.
In [13]: s.decode()
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 20: ordinal not in range(128)
How to decode such string into unicode.
To identify the Non Unicode characters we can use either Google Chrome or Mozilla firefox browser by just dragging and dropping the file to the browser. Chrome will show us only the row and column number of the .
The encode() method encodes the string, using the specified encoding. If no encoding is specified, UTF-8 will be used.
I have had to face this problem one too many times. The problem that I had contained strings in different encoding schemes. So I wrote a method to decode a string heuristically based on certain features of different encodings.
def decode_heuristically(string, enc = None, denc = sys.getdefaultencoding()):
"""
Try to interpret 'string' using several possible encodings.
@input : string, encode type.
@output: a list [decoded_string, flag_decoded, encoding]
"""
if isinstance(string, unicode): return string, 0, "utf-8"
try:
new_string = unicode(string, "ascii")
return string, 0, "ascii"
except UnicodeError:
encodings = ["utf-8","iso-8859-1","cp1252","iso-8859-15"]
if denc != "ascii": encodings.insert(0, denc)
if enc: encodings.insert(0, enc)
for enc in encodings:
if (enc in ("iso-8859-15", "iso-8859-1") and
re.search(r"[\x80-\x9f]", string) is not None):
continue
if (enc in ("iso-8859-1", "cp1252") and
re.search(r"[\xa4\xa6\xa8\xb4\xb8\xbc-\xbe]", string)\
is not None):
continue
try:
new_string = unicode(string, enc)
except UnicodeError:
pass
else:
if new_string.encode(enc) == string:
return new_string, 0, enc
# If unable to decode,doing force decoding i.e.neglecting those chars.
output = [(unicode(string, enc, "ignore"), enc) for enc in encodings]
output = [(len(new_string[0]), new_string) for new_string in output]
output.sort()
new_string, enc = output[-1][1]
return new_string, 1, enc
To add to this this link gives a good feedback on why encoding etc - Why we need sys.setdefaultencoging in py script
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With