Safe decoding in python ('?' symbol instead of exception)

Question

I have code:

encoding = guess_encoding()    
text = unicode(text, encoding)

when wrong symbol appears in text UnicodeDecode exception is raised. How can I silently skip exception replacing wrong symbol with '?' ?

Sven Marnach · Accepted Answer

Try

text = unicode(text, encoding, "replace")

From the documentation:

'replace' causes the official Unicode replacement character, U+FFFD, to be used to replace input characters which cannot be decoded.

If you want to use "?" instead of the official Unicode replacement character, you can do

text = text.replace(u"\uFFFD", "?")

after converting to unicode.

Laszlo Treszkai · Answer

In Python 3, you can decode a bytes object into a string using the decode method. It accepts two parameters:

encoding, which is "utf-8" by default, and
errors, which defines what to do on illegal character sequences. The default value is "strict", which raises a UnicodeDecodeError; other alternatives are ignore and replace -- the latter replaces illegal characters with the Unicode replacement character "\uFFFD".

Therefore, you'd need to do this to decode-and-replace:

encoding = guess_encoding()
text = text_bytes.decode(encoding, errors='replace').replace('\uFFFD', '?')

As Sven Marnach pointed out in a comment, you can supply the errors argument directly to open; otherwise you'd get the decode errors while reading the file (if it falls out of the character map).

Safe decoding in python ('?' symbol instead of exception)

Tags:

python

kilonet

2 Answers

Sven Marnach

Laszlo Treszkai

Recent Activity

Donate For Us

Safe decoding in python ('?' symbol instead of exception)

Tags:

python

kilonet

2 Answers

Sven Marnach

Laszlo Treszkai

Related questions

Recent Activity

Donate For Us