Python2: Using .decode with errors='replace' still returns errors

Tags:

So I have a message which is read from a file of unknown encoding. I want to send to a webpage for display. I've grappled a lot with UnicodeErrors and have gone through many Q&As on StackOverflow and think I have decent understanding of how Unicode and encoding works. My current code looks like this

try :
            return message.decode(encoding='utf-8')
        except:
            try:
                return message.decode(encoding='latin-1')
            except:
                try:
                    print("Unable to entirely decode in latin or utf-8, will replace error characters with '?'")
                    return message.decode(encoding='utf-8', errors="replace")

The returned message is then dumped into a JSON and send to the front end.

I assumed that because I'm using errors="replace"on the last try except that I was going to avoid exceptions at the expense of having a few '?' characters in my display. An acceptable cost.

However, it seems that I was too hopeful, and for some files I still get a UnicodeDecodeException saying "ascii codecs cannot decode" for some character. Why doesn't errors="replace" just take care of this?

(also as a bonus question, what does ascii have to do with any of this?.. I'm specifying UTF-8)

419

asked Oct 13 '16 19:10

Jad S

2 Answers

You should not get a UnicodeDecodeError with errors='replace'. Also str.decode('latin-1') should never fail, because ISO-8859-1 has a valid character mapping for every possible byte sequence.

My suspicion is that message is already a unicode string, not bytes. Unicode text has already been ‘decoded’ from bytes and can't be decoded any more.

When you call .decode() an a unicode string, Python 2 tries to be helpful and decides to encode the Unicode string back to bytes (using the default encoding), so that you have something that you can really decode. This implicit encoding step doesn't use errors='replace', so if there are any characters in the Unicode string that aren't in the default encoding (probably ASCII) you'll get a UnicodeEncodeError.

(Python 3 no longer does this as it is terribly confusing.)

Check the type of message and assuming it is indeed Unicode, work back from there to find where it was decoded (possibly implicitly) to replace that with the correct decoding.

150

answered Oct 12 '22 09:10

bobince

decode with error replace implements the 'replace' error handling (for text encodings only): substitutes '?' for encoding errors (to be encoded by the codec), and '\ufffd' (the Unicode replacement character) for decoding errors

text encodings means A "codec which encodes Unicode strings to bytes."

maybe your data is malformed - u should try 'ignore' error handling where malformed data is ignored and encoding or decoding is continued without further notice.

message.decode(encoding='utf-8', errors="ignore")

answered Oct 12 '22 10:10

Jan

Related questions
                            
                                additional column when saving pandas data frame to csv file
                            
                                Pandas Dataframe Line Plot: Show Random Markers
                            
                                Python Pandas read_excel doesn't recognize null cell
                            
                                Run multiple servers in python at same time (Threading)
                            
                                How to use yaml.load_all with fileinput.input?
                            
                                Divide two dataframes with python
                            
                                crontab to run python file if not running already
                            
                                How move a multipolygon with geopandas in python2
                            
                                Calculating the sum of a series?
                            
                                Python dictionary lookup performance, get vs in
                            
                                How do I pull a recurring key from a JSON?
                            
                                Using regex, extract quoted strings that may contain nested quotes
                            
                                Override the class patch with method patch (decorator)
                            
                                Using python requests and beautiful soup to pull text
                            
                                Model in Django 1.9. TypeError: __init__() got multiple values for argument 'verbose_name'
                            
                                What is libpython3.so compared with libpython3.5m.so built from python 3.5.2 source?
                            
                                Pandas pivot table: columns order and subtotals
                            
                                Identifying closest value in a column for each filter using Pandas
                            
                                Why does __slots__ = ('__dict__',) produce smaller instances?
                            
                                Encrypt in python and decrypt in Java with AES-CFB

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python2: Using .decode with errors='replace' still returns errors

Tags:

python

character-encoding

unicode

python-2.7

Jad S

People also ask

2 Answers

bobince

Jan

Recent Activity

Donate For Us