As far as I know it is the concept of python to have only valid characters in a string, but in my case the OS will deliver strings with invalid encodings in path names I have to deal with. So I end up with strings that contain characters that are non-unicode. In order to correct these problems I need to display these strings somehow. Unfortunately I can not print them because they contain non-unicode characters. Is there an elegant way to replace these characters somehow to at least get some idea of the content of the string? My idea would be to process these strings character by character and check if the character stored is actually valid unicode. In case of an invalid character I would like to use a certain unicode symbol. But how can I do this? Using <code>codecs</code> seems not to be suitable for that purpose: I already have a string, returned by the operating system, and not a byte array. Converting a string to byte array seems to involve decoding which will fail in my case of course. So it seems that I'm stuck. Do you have an tips for me how to be able to create such a replacement string?

If you have a bytestring (undecoded data), use the <code>'replace'</code> error handler. For example, if your data is (mostly) UTF-8 encoded, then you could use: <pre class="prettyprint"><code>decoded_unicode = bytestring.decode('utf-8', 'replace') </code></pre> and U+FFFD � REPLACEMENT CHARACTER characters will be inserted for any bytes that can't be decoded. If you wanted to use a different replacement character, it is easy enough to replace these afterwards: <pre class="prettyprint"><code>decoded_unicode = decoded_unicode.replace('\ufffd', '#') </code></pre> Demo: <pre class="prettyprint"><code>>>> bytestring = b'F\xc3\xb8\xc3\xb6\xbbB\xc3\xa5r' >>> bytestring.decode('utf8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf8' codec can't decode byte 0xbb in position 5: invalid start byte >>> bytestring.decode('utf8', 'replace') 'Føö�Bår' </code></pre>

How to replace invalid unicode characters in a string in Python?

Tags:

python

string

character-encoding

unicode

As far as I know it is the concept of python to have only valid characters in a string, but in my case the OS will deliver strings with invalid encodings in path names I have to deal with. So I end up with strings that contain characters that are non-unicode.

In order to correct these problems I need to display these strings somehow. Unfortunately I can not print them because they contain non-unicode characters. Is there an elegant way to replace these characters somehow to at least get some idea of the content of the string?

My idea would be to process these strings character by character and check if the character stored is actually valid unicode. In case of an invalid character I would like to use a certain unicode symbol. But how can I do this? Using codecs seems not to be suitable for that purpose: I already have a string, returned by the operating system, and not a byte array. Converting a string to byte array seems to involve decoding which will fail in my case of course. So it seems that I'm stuck.

Do you have an tips for me how to be able to create such a replacement string?

935

asked Jul 25 '16 09:07

Regis May

1 Answers

If you have a bytestring (undecoded data), use the 'replace' error handler. For example, if your data is (mostly) UTF-8 encoded, then you could use:

decoded_unicode = bytestring.decode('utf-8', 'replace')

and U+FFFD � REPLACEMENT CHARACTER characters will be inserted for any bytes that can't be decoded.

If you wanted to use a different replacement character, it is easy enough to replace these afterwards:

decoded_unicode = decoded_unicode.replace('\ufffd', '#')

Demo:

>>> bytestring = b'F\xc3\xb8\xc3\xb6\xbbB\xc3\xa5r'
>>> bytestring.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbb in position 5: invalid start byte
>>> bytestring.decode('utf8', 'replace')
'Føö�Bår'

answered Sep 16 '22 12:09

Martijn Pieters

Related questions
                            
                                QComboBox click event
                            
                                Add a white background to colorbar in matplotlib
                            
                                how to make a new numpy array same size as a given array and fill it with a scalar value
                            
                                How to convert 2d numpy array into binary indicator matrix for max value
                            
                                How to create a random array in a certain range
                            
                                How to get all mails from MS exchange in Python?
                            
                                Spherical coordinates plot in matplotlib
                            
                                Closures, Partials and Decorators
                            
                                aws cli in cygwin - how to clean up differences in windows and cygwin style paths
                            
                                Reading first lines of bz2 files in python
                            
                                why does this script not work with threading python
                            
                                Flask-admin - how to change formatting of columns - get URLs to display
                            
                                Merge two or more lists with given order of merging
                            
                                What is the "format" parameter used for in Django REST Framework views?
                            
                                Convert a list to json objects
                            
                                How to implement Poisson Regression?
                            
                                How to pass const char* from python to c function
                            
                                How to play mp3 from URL
                            
                                exposing C++ class in Python ( only ET_DYN and ET_EXEC can be loaded)
                            
                                pandas describe by - additional parameters

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With