Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to decode a non unicode character in python?

Tags:

python

unicode

I have a string say s = 'Chocolate Moelleux-M\xe8re' When i am doing:

In [14]: unicode(s)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 20: ordinal not in range(128)

Similarly when i am trying to decode this by using s.decode() it returns same error.

In [13]: s.decode()
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 20: ordinal not in range(128)

How to decode such string into unicode.

like image 275
user12345 Avatar asked Oct 06 '10 06:10

user12345


People also ask

How do you find non Unicode characters?

To identify the Non Unicode characters we can use either Google Chrome or Mozilla firefox browser by just dragging and dropping the file to the browser. Chrome will show us only the row and column number of the .

What does encode () do in Python?

The encode() method encodes the string, using the specified encoding. If no encoding is specified, UTF-8 will be used.


1 Answers

I have had to face this problem one too many times. The problem that I had contained strings in different encoding schemes. So I wrote a method to decode a string heuristically based on certain features of different encodings.

def decode_heuristically(string, enc = None, denc = sys.getdefaultencoding()):
    """
    Try to interpret 'string' using several possible encodings.
    @input : string, encode type.
    @output: a list [decoded_string, flag_decoded, encoding]
    """
    if isinstance(string, unicode): return string, 0, "utf-8"
    try:
        new_string = unicode(string, "ascii")
        return string, 0, "ascii"
    except UnicodeError:
        encodings = ["utf-8","iso-8859-1","cp1252","iso-8859-15"]

        if denc != "ascii": encodings.insert(0, denc)

        if enc: encodings.insert(0, enc)

        for enc in encodings:
            if (enc in ("iso-8859-15", "iso-8859-1") and
                re.search(r"[\x80-\x9f]", string) is not None):
                continue

            if (enc in ("iso-8859-1", "cp1252") and
                re.search(r"[\xa4\xa6\xa8\xb4\xb8\xbc-\xbe]", string)\
                is not None):
                continue

            try:
                new_string = unicode(string, enc)
            except UnicodeError:
                pass
            else:
                if new_string.encode(enc) == string:
                    return new_string, 0, enc

        # If unable to decode,doing force decoding i.e.neglecting those chars.
        output = [(unicode(string, enc, "ignore"), enc) for enc in encodings]
        output = [(len(new_string[0]), new_string) for new_string in output]
        output.sort()
        new_string, enc = output[-1][1]
        return new_string, 1, enc

To add to this this link gives a good feedback on why encoding etc - Why we need sys.setdefaultencoging in py script

like image 147
Srikar Appalaraju Avatar answered Oct 24 '22 14:10

Srikar Appalaraju