Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 3: os.walk() file paths UnicodeEncodeError: 'utf-8' codec can't encode: surrogates not allowed

This code:

for root, dirs, files in os.walk('.'):     print(root) 

Gives me this error:

UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 27: surrogates not allowed 

How do I walk through a file tree without getting toxic strings like this?

like image 823
Collin Anderson Avatar asked Dec 08 '14 20:12

Collin Anderson


People also ask

How do I fix UnicodeEncodeError in Python?

Only a limited number of Unicode characters are mapped to strings. Thus, any character that is not-represented / mapped will cause the encoding to fail and raise UnicodeEncodeError. To avoid this error use the encode( utf-8 ) and decode( utf-8 ) functions accordingly in your code.

How do I decode a UTF-8 string in Python?

To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.

What is Surrogateescape?

[surrogateescape] handles decoding errors by squirreling the data away in a little used part of the Unicode code point space. When encoding, it translates those hidden away values back into the exact original byte sequence that failed to decode correctly.


2 Answers

On Linux, filenames are 'just a bunch of bytes', and are not necessarily encoded in a particular encoding. Python 3 tries to turn everything into Unicode strings. In doing so the developers came up with a scheme to translate byte strings to Unicode strings and back without loss, and without knowing the original encoding. They used partial surrogates to encode the 'bad' bytes, but the normal UTF8 encoder can't handle them when printing to the terminal.

For example, here's a non-UTF8 byte string:

>>> b'C\xc3N'.decode('utf8','surrogateescape') 'C\udcc3N' 

It can be converted to and from Unicode without loss:

>>> b'C\xc3N'.decode('utf8','surrogateescape').encode('utf8','surrogateescape') b'C\xc3N' 

But it can't be printed:

>>> print(b'C\xc3N'.decode('utf8','surrogateescape')) Traceback (most recent call last):   File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 1: surrogates not allowed 

You'll have to figure out what you want to do with file names with non-default encodings. Perhaps just encoding them back to original bytes and decode them with unknown replacement. Use this for display but keep the original name to access the file.

>>> b'C\xc3N'.decode('utf8','replace') C�N 

os.walk can also take a byte string and will return byte strings instead of Unicode strings:

for p,d,f in os.walk(b'.'): 

Then you can decode as you like.

like image 109
Mark Tolonen Avatar answered Sep 22 '22 08:09

Mark Tolonen


I ended up passing in a byte string to os.walk() which will apparently return byte strings instead of incorrect unicode strings

for root, dirs, files in os.walk(b'.'):     print(root) 
like image 30
Collin Anderson Avatar answered Sep 23 '22 08:09

Collin Anderson