I'm writing a script that has to move some file around, but unfortunately it doesn't seem os.path
plays with internationalization very well. When I have files named in Hebrew, there are problems. Here's a screenshot of the contents of a directory:
(source: thegreenplace.net)
Now consider this code that goes over the files in this directory:
files = os.listdir('test_source')
for f in files:
pf = os.path.join('test_source', f)
print pf, os.path.exists(pf)
The output is:
test_source\ex True
test_source\joe True
test_source\mie.txt True
test_source\__()'''.txt True
test_source\????.txt False
Notice how os.path.exists
thinks that the hebrew-named file doesn't even exist?
How can I fix this?
ActivePython 2.5.2 on Windows XP Home SP2
Hmm, after some digging it appears that when supplying os.listdir a unicode string, this kinda works:
files = os.listdir(u'test_source')
for f in files:
pf = os.path.join(u'test_source', f)
print pf.encode('ascii', 'replace'), os.path.exists(pf)
===>
test_source\ex True
test_source\joe True
test_source\mie.txt True
test_source\__()'''.txt True
test_source\????.txt True
Some important observations here:
os.listdir
(and similar functions, like os.walk
) should be passed a unicode string in order to work correctly with unicode paths. Here's a quote from the aforementioned link:os.listdir(), which returns filenames, raises an issue: should it return the Unicode version of filenames, or should it return 8-bit strings containing the encoded versions? os.listdir() will do both, depending on whether you provided the directory path as an 8-bit string or a Unicode string. If you pass a Unicode string as the path, filenames will be decoded using the filesystem's encoding and a list of Unicode strings will be returned, while passing an 8-bit path will return the 8-bit versions of the filenames.
print
wants an ascii string, not unicode, so the path has to be encoded to ascii.It looks like a Unicode vs ASCII issue - os.listdir
is returning a list of ASCII strings.
Edit: I tried it on Python 3.0, also on XP SP2, and os.listdir
simply omitted the Hebrew filenames instead of listing them at all.
According to the docs, this means it was unable to decode it:
Note that when os.listdir() returns a list of strings, filenames that cannot be decoded properly are omitted rather than raising UnicodeError.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With