how to decode a non unicode character in python?

Tags:

unicode

I have a string say s = 'Chocolate Moelleux-M\xe8re' When i am doing:

In [14]: unicode(s)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 20: ordinal not in range(128)

Similarly when i am trying to decode this by using s.decode() it returns same error.

In [13]: s.decode()
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 20: ordinal not in range(128)

How to decode such string into unicode.

275

asked Oct 06 '10 06:10

1 Answers

I have had to face this problem one too many times. The problem that I had contained strings in different encoding schemes. So I wrote a method to decode a string heuristically based on certain features of different encodings.

def decode_heuristically(string, enc = None, denc = sys.getdefaultencoding()):
    """
    Try to interpret 'string' using several possible encodings.
    @input : string, encode type.
    @output: a list [decoded_string, flag_decoded, encoding]
    """
    if isinstance(string, unicode): return string, 0, "utf-8"
    try:
        new_string = unicode(string, "ascii")
        return string, 0, "ascii"
    except UnicodeError:
        encodings = ["utf-8","iso-8859-1","cp1252","iso-8859-15"]

        if denc != "ascii": encodings.insert(0, denc)

        if enc: encodings.insert(0, enc)

        for enc in encodings:
            if (enc in ("iso-8859-15", "iso-8859-1") and
                re.search(r"[\x80-\x9f]", string) is not None):
                continue

            if (enc in ("iso-8859-1", "cp1252") and
                re.search(r"[\xa4\xa6\xa8\xb4\xb8\xbc-\xbe]", string)\
                is not None):
                continue

            try:
                new_string = unicode(string, enc)
            except UnicodeError:
                pass
            else:
                if new_string.encode(enc) == string:
                    return new_string, 0, enc

        # If unable to decode,doing force decoding i.e.neglecting those chars.
        output = [(unicode(string, enc, "ignore"), enc) for enc in encodings]
        output = [(len(new_string[0]), new_string) for new_string in output]
        output.sort()
        new_string, enc = output[-1][1]
        return new_string, 1, enc

To add to this this link gives a good feedback on why encoding etc - Why we need sys.setdefaultencoging in py script

147

answered Oct 24 '22 14:10

Srikar Appalaraju

Related questions
                            
                                Sphinx - generate automatic references to Trac tickets and changesets
                            
                                how to find time at particular timezone from anywhere
                            
                                Is there something similar to python's enumerate for linq
                            
                                Is fortran-like print in python possible?
                            
                                Rendering mathematical notation in Python / OpenGL?
                            
                                How to check if a file contains plain text?
                            
                                Parse a CSV file using python (to make a decision tree later) [closed]
                            
                                How do I find difference between times in different timezones in Python?
                            
                                SSL and WSGI apps - Python
                            
                                Injecting raw TCP packets with Python
                            
                                Is it possible to find and delete orphaned blobs in the app engine blobstore?
                            
                                In django : how to renew expiry date for current session?
                            
                                Access static class variable of parent class in Python
                            
                                simplify simple C++ code -- something like Pythons any
                            
                                Python : Allowing methods not specifically defined to be called ala __getattr__
                            
                                How to check type of variable? Python
                            
                                How do I step through/debug a python web application?
                            
                                Unicode error using matplotlib with log scale on Windows
                            
                                Python ctypes, C++ object destruction
                            
                                about python datetime type

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to decode a non unicode character in python?

Tags:

python

unicode

user12345

People also ask

1 Answers

Srikar Appalaraju

Recent Activity

Donate For Us