How to replace characters that cannot be decoded using utf8 with whitespace?
# -*- coding: utf-8 -*-
print unicode('\x97', errors='ignore') # print out nothing
print unicode('ABC\x97abc', errors='ignore') # print out ABCabc
How can I print out ABC abc
instead of ABCabc
? Note, \x97
is just an example character. The characters that cannot be decoded are unknown inputs.
errors='ignore'
, it will print out nothing. errors='replace'
, it will replace that character with some special chars. Unicode Literals in Python Source Code Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. The \U escape sequence is similar, but expects 8 hex digits, not 4.
The key to troubleshooting Unicode errors in Python is to know what types you have. Then, try these steps: If some variables are byte sequences instead of Unicode objects, convert them to Unicode objects with decode() / u” before handling them.
Take a look at codecs.register_error
. You can use it to register custom error handlers
https://docs.python.org/2/library/codecs.html#codecs.register_error
import codecs
codecs.register_error('replace_with_space', lambda e: (u' ',e.start + 1))
print unicode('ABC\x97abc', encoding='utf-8', errors='replace_with_space')
You can use a try-except
statement to handle the UnicodeDecodeError
:
def my_encoder(my_string):
for i in my_string:
try :
yield unicode(i)
except UnicodeDecodeError:
yield '\t' #or another whietespaces
And then use str.join
method to join your string :
print ''.join(my_encoder(my_string))
Demo :
>>> print ''.join(my_encoder('this is a\x97n exam\x97ple'))
this is a n exam ple
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With