Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Safe decoding in python ('?' symbol instead of exception)

Tags:

python

I have code:

encoding = guess_encoding()    
text = unicode(text, encoding)

when wrong symbol appears in text UnicodeDecode exception is raised. How can I silently skip exception replacing wrong symbol with '?' ?

like image 326
kilonet Avatar asked Dec 07 '22 22:12

kilonet


2 Answers

Try

text = unicode(text, encoding, "replace")

From the documentation:

'replace' causes the official Unicode replacement character, U+FFFD, to be used to replace input characters which cannot be decoded.

If you want to use "?" instead of the official Unicode replacement character, you can do

text = text.replace(u"\uFFFD", "?")

after converting to unicode.

like image 67
Sven Marnach Avatar answered Dec 21 '22 23:12

Sven Marnach


In Python 3, you can decode a bytes object into a string using the decode method. It accepts two parameters:

  • encoding, which is "utf-8" by default, and
  • errors, which defines what to do on illegal character sequences. The default value is "strict", which raises a UnicodeDecodeError; other alternatives are ignore and replace -- the latter replaces illegal characters with the Unicode replacement character "\uFFFD".

Therefore, you'd need to do this to decode-and-replace:

encoding = guess_encoding()
text = text_bytes.decode(encoding, errors='replace').replace('\uFFFD', '?')

As Sven Marnach pointed out in a comment, you can supply the errors argument directly to open; otherwise you'd get the decode errors while reading the file (if it falls out of the character map).

like image 22
Laszlo Treszkai Avatar answered Dec 21 '22 23:12

Laszlo Treszkai