Python 3: os.walk() file paths UnicodeEncodeError: 'utf-8' codec can't encode: surrogates not allowed

Tags:

This code:

for root, dirs, files in os.walk('.'):     print(root)

Gives me this error:

UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 27: surrogates not allowed

How do I walk through a file tree without getting toxic strings like this?

823

asked Dec 08 '14 20:12

Collin Anderson

2 Answers

On Linux, filenames are 'just a bunch of bytes', and are not necessarily encoded in a particular encoding. Python 3 tries to turn everything into Unicode strings. In doing so the developers came up with a scheme to translate byte strings to Unicode strings and back without loss, and without knowing the original encoding. They used partial surrogates to encode the 'bad' bytes, but the normal UTF8 encoder can't handle them when printing to the terminal.

For example, here's a non-UTF8 byte string:

>>> b'C\xc3N'.decode('utf8','surrogateescape') 'C\udcc3N'

It can be converted to and from Unicode without loss:

>>> b'C\xc3N'.decode('utf8','surrogateescape').encode('utf8','surrogateescape') b'C\xc3N'

But it can't be printed:

>>> print(b'C\xc3N'.decode('utf8','surrogateescape')) Traceback (most recent call last):   File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 1: surrogates not allowed

You'll have to figure out what you want to do with file names with non-default encodings. Perhaps just encoding them back to original bytes and decode them with unknown replacement. Use this for display but keep the original name to access the file.

>>> b'C\xc3N'.decode('utf8','replace') C�N

os.walk can also take a byte string and will return byte strings instead of Unicode strings:

for p,d,f in os.walk(b'.'):

Then you can decode as you like.

109

answered Sep 22 '22 08:09

Mark Tolonen

I ended up passing in a byte string to os.walk() which will apparently return byte strings instead of incorrect unicode strings

for root, dirs, files in os.walk(b'.'):     print(root)

answered Sep 23 '22 08:09

Collin Anderson

Related questions
                            
                                AttributeError: can't set attribute when connecting to sqlite database with flask-sqlalchemy
                            
                                How to Check if request.GET var is None?
                            
                                Get "2:35pm" instead of "02:35PM" from Python date/time?
                            
                                python subclassing multiprocessing.Process
                            
                                NoSQL Solution for Persisting Graphs at Scale
                            
                                How do I close the files from tempfile.mkstemp?
                            
                                What is the meaning of the nu parameter in Scikit-Learn's SVM class?
                            
                                How can I convert a string into a date object and get year, month and day separately?
                            
                                Is there a Python dict without values?
                            
                                Flask WTForms: Difference between DataRequired and InputRequired
                            
                                How to install the png module in python
                            
                                Running Job On Airflow Based On Webrequest
                            
                                Python: ImportError: lxml not found, please install it
                            
                                Best way to convert a Unicode URL to ASCII (UTF-8 percent-escaped) in Python?
                            
                                Can I count on order being preserved in a Python tuple?
                            
                                Regression with Date variable using Scikit-learn
                            
                                passing a function as an argument in python
                            
                                Testing Flask login and authentication?
                            
                                How to Manage Google API Errors in Python
                            
                                matplotlib.pyplot has no attribute 'style'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python 3: os.walk() file paths UnicodeEncodeError: 'utf-8' codec can't encode: surrogates not allowed

Tags:

python

python-3.x

unicode

python-unicode

unicode-string

Collin Anderson

People also ask

2 Answers

Mark Tolonen

Collin Anderson

Recent Activity

Donate For Us