So I have a message
which is read from a file of unknown encoding. I want to send to a webpage for display. I've grappled a lot with UnicodeErrors and have gone through many Q&As on StackOverflow and think I have decent understanding of how Unicode and encoding works. My current code looks like this
try :
return message.decode(encoding='utf-8')
except:
try:
return message.decode(encoding='latin-1')
except:
try:
print("Unable to entirely decode in latin or utf-8, will replace error characters with '?'")
return message.decode(encoding='utf-8', errors="replace")
The returned message is then dumped into a JSON and send to the front end.
I assumed that because I'm using errors="replace"
on the last try except
that I was going to avoid exceptions at the expense of having a few '?' characters in my display. An acceptable cost.
However, it seems that I was too hopeful, and for some files I still get a UnicodeDecodeException
saying "ascii codecs cannot decode" for some character. Why doesn't errors="replace"
just take care of this?
(also as a bonus question, what does ascii have to do with any of this?.. I'm specifying UTF-8)
Only a limited number of Unicode characters are mapped to strings. Thus, any character that is not-represented / mapped will cause the encoding to fail and raise UnicodeEncodeError. To avoid this error use the encode( utf-8 ) and decode( utf-8 ) functions accordingly in your code.
decode() is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string.
The sole (intended) purpose of surrogateescape is to work around the "feature" of Unix where filenames can be any arbitrary string of bytes, but at the same time they are usually UTF-8 (or, on older systems, some ASCII-superset 8-bit encoding like ISO-8859-1).
You should not get a UnicodeDecodeError
with errors='replace'
. Also str.decode('latin-1')
should never fail, because ISO-8859-1 has a valid character mapping for every possible byte sequence.
My suspicion is that message
is already a unicode
string, not bytes. Unicode text has already been ‘decoded’ from bytes and can't be decoded any more.
When you call .decode()
an a unicode
string, Python 2 tries to be helpful and decides to encode the Unicode string back to bytes (using the default encoding), so that you have something that you can really decode. This implicit encoding step doesn't use errors='replace'
, so if there are any characters in the Unicode string that aren't in the default encoding (probably ASCII) you'll get a UnicodeEncodeError
.
(Python 3 no longer does this as it is terribly confusing.)
Check the type of message
and assuming it is indeed Unicode
, work back from there to find where it was decoded (possibly implicitly) to replace that with the correct decoding.
decode with error replace implements the 'replace' error handling (for text encodings only): substitutes '?' for encoding errors (to be encoded by the codec), and '\ufffd' (the Unicode replacement character) for decoding errors
text encodings means A "codec which encodes Unicode strings to bytes."
maybe your data is malformed - u should try 'ignore' error handling where malformed data is ignored and encoding or decoding is continued without further notice.
message.decode(encoding='utf-8', errors="ignore")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With