How to replace characters that cannot be decoded using utf8 with whitespace? <pre class="prettyprint"><code># -*- coding: utf-8 -*- print unicode('\x97', errors='ignore') # print out nothing print unicode('ABC\x97abc', errors='ignore') # print out ABCabc </code></pre> How can I print out <code>ABC abc</code> instead of <code>ABCabc</code>? Note, <code>\x97</code> is just an example character. The characters that cannot be decoded are unknown inputs. <ul> <li>If we use <code>errors='ignore'</code>, it will print out nothing. </li> <li>If we use <code>errors='replace'</code>, it will replace that character with some special chars. </li> </ul>

Take a look at <code>codecs.register_error</code>. You can use it to register custom error handlers https://docs.python.org/2/library/codecs.html#codecs.register_error <pre class="prettyprint"><code>import codecs codecs.register_error('replace_with_space', lambda e: (u' ',e.start + 1)) print unicode('ABC\x97abc', encoding='utf-8', errors='replace_with_space') </code></pre>

You can use a <code>try-except</code> statement to handle the <code>UnicodeDecodeError</code> : <pre class="prettyprint"><code>def my_encoder(my_string): for i in my_string: try : yield unicode(i) except UnicodeDecodeError: yield '\t' #or another whietespaces </code></pre> And then use <code>str.join</code> method to join your string : <pre class="prettyprint"><code>print ''.join(my_encoder(my_string)) </code></pre> Demo : <pre class="prettyprint"><code>>>> print ''.join(my_encoder('this is a\x97n exam\x97ple')) this is a n exam ple </code></pre>

Python unicode: how to replace character that cannot be decoded using utf8 with whitespace?

Tags:

python

unicode

utf-8

How to replace characters that cannot be decoded using utf8 with whitespace?

# -*- coding: utf-8 -*-
print unicode('\x97', errors='ignore') # print out nothing
print unicode('ABC\x97abc', errors='ignore') # print out ABCabc

How can I print out ABC abc instead of ABCabc? Note, \x97 is just an example character. The characters that cannot be decoded are unknown inputs.

If we use errors='ignore', it will print out nothing.
If we use errors='replace', it will replace that character with some special chars.

570

asked Aug 20 '15 10:08

DehengYe

2 Answers

Take a look at codecs.register_error. You can use it to register custom error handlers

https://docs.python.org/2/library/codecs.html#codecs.register_error

import codecs
codecs.register_error('replace_with_space', lambda e: (u' ',e.start + 1))
print unicode('ABC\x97abc', encoding='utf-8', errors='replace_with_space')

answered Sep 29 '22 22:09

HelloWorld

You can use a try-except statement to handle the UnicodeDecodeError :

def my_encoder(my_string):
   for i in my_string:
      try :
         yield unicode(i)
      except UnicodeDecodeError:
         yield '\t' #or another whietespaces

And then use str.join method to join your string :

print ''.join(my_encoder(my_string))

Demo :

>>> print ''.join(my_encoder('this is a\x97n exam\x97ple'))
this is a   n exam  ple

answered Sep 30 '22 00:09

Mazdak

Related questions
                            
                                How to increase the performance for estimating `Pi`in Python
                            
                                Convert string to ISODate in MongoDB
                            
                                Fast 1D linear np.NaN interpolation over large 3D array
                            
                                QFileDialog - differences between PyQt4/PyQt5/PySide
                            
                                Recover Python script from memory, I screwed up
                            
                                Django: Invalid block tag: 'static', expected 'endif'
                            
                                OpenCV can't find ORB
                            
                                How to index nested lists in Python?
                            
                                Iterate through each value of list in order, starting at random value
                            
                                How to keep the current figure when using ipython notebook with %matplotlib inline?
                            
                                Issue in setting the background color in pyqtgraph
                            
                                write numpy array to CSV with row indices and header
                            
                                argparse argument named "print"
                            
                                Python Bottle - Difference between "redirect" and "return template"
                            
                                Python open() append and read, file.read() returns empty string
                            
                                Make 2D Numpy array from coordinates
                            
                                Using NumPy to Find Median of Second Element of List of Tuples
                            
                                Trying to load cookie into requests session from dictionary
                            
                                rm all files under a directory using python subprocess.call
                            
                                Python 2.7 - Why python encode a string when .append() in a list?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With