Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python unicode: how to replace character that cannot be decoded using utf8 with whitespace?

How to replace characters that cannot be decoded using utf8 with whitespace?

# -*- coding: utf-8 -*-
print unicode('\x97', errors='ignore') # print out nothing
print unicode('ABC\x97abc', errors='ignore') # print out ABCabc

How can I print out ABC abc instead of ABCabc? Note, \x97 is just an example character. The characters that cannot be decoded are unknown inputs.

  • If we use errors='ignore', it will print out nothing.
  • If we use errors='replace', it will replace that character with some special chars.
like image 570
DehengYe Avatar asked Aug 20 '15 10:08

DehengYe


People also ask

How do you escape a Unicode character in Python?

Unicode Literals in Python Source Code Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. The \U escape sequence is similar, but expects 8 hex digits, not 4.

How does Python handle Unicode errors?

The key to troubleshooting Unicode errors in Python is to know what types you have. Then, try these steps: If some variables are byte sequences instead of Unicode objects, convert them to Unicode objects with decode() / u” before handling them.


2 Answers

Take a look at codecs.register_error. You can use it to register custom error handlers

https://docs.python.org/2/library/codecs.html#codecs.register_error

import codecs
codecs.register_error('replace_with_space', lambda e: (u' ',e.start + 1))
print unicode('ABC\x97abc', encoding='utf-8', errors='replace_with_space')
like image 54
HelloWorld Avatar answered Sep 29 '22 22:09

HelloWorld


You can use a try-except statement to handle the UnicodeDecodeError :

def my_encoder(my_string):
   for i in my_string:
      try :
         yield unicode(i)
      except UnicodeDecodeError:
         yield '\t' #or another whietespaces 

And then use str.join method to join your string :

print ''.join(my_encoder(my_string))

Demo :

>>> print ''.join(my_encoder('this is a\x97n exam\x97ple'))
this is a   n exam  ple
like image 41
Mazdak Avatar answered Sep 30 '22 00:09

Mazdak