Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UnicodeEncodeError on joining file name

It throws out "UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)" when executing following code:

filename = 'Spywaj.ttf'
print repr(filename)
>> 'Sp\xc2\x88ywaj.ttf'
filepath = os.path.join('/dirname', filename)

But the file is valid and existed on disk. Filename was extracted from "unzip -l" command. How can join filenames like this?

OS and filesystem

Filesystem: ext3    relatime,errors=remount-ro 0       0
Locale: en_US.UTF-8

Alex's suggestion os.path.join works now but I still cannot access the file on disk with the filename it joined.

filename = filename.decode('utf-8')
filepath = os.path.join('/dirname', filename)
print filepath
>> /dirname/u'Sp\xc2\x88ywaj.ttf'
print os.path.isfile(filepath)
>> False

new_filepath = filepath.encode('Latin-1').encode('utf-8')
print new_filepath
>> /dirname/u'Sp\xc2\x88ywaj.ttf'
print type(filepath)
>> <type 'unicode'>
print os.path.isfile(new_filepath)
>> False

valid_filepath = glob.glob('/dirname/*.ttf')[0]
print valid_filepath
>> /dirname/Spywaj.ttf (SO cannot display the chars in filename)
print type(valid_filepath)
>> <type 'str'>
print os.path.isfile(valid_filepath)
>> True
like image 207
jack Avatar asked Jan 05 '10 04:01

jack


3 Answers

In both Latin-1 (ISO-8859-1) and Windows-1252, 0xc2 would a capital A with a circumflex accent... doesn't seem to be anywhere in the code you show! Can you please add a

print repr(filename)

before the os.path.join call (and also put the '/dirname' in a variable and print its repr for completeness?). I'm thinking that maybe that stray character is there but you're not seeing it for some reason -- the repr will reveal it.

If you do have a Latin-1 (or Win-1252) non-Ascii character in your filename, you have to use Unicode -- and/or, depending on your OS and filesystem, some specific encoding thereof.

Edit: the OP confirms, thanks to repr, that there are actually two bytes that can't possibly be ASCII -- 0xc2 then 0x88, corresponding to what the OP thinks is one lowercase L. Well, that sequence would be a Unicode uppercase A with caret (codepoint 0x88) in the justly popular UTF-8 encoding - how that could look like a lowercase L to the OP beggars explanation, but I imagine some fonts could be graphically crazy enough to afford such confusion.

So I would first try filename = filename.decode('utf-8') -- that should allow the os.path.join to work. If open then balks at the resulting Unicode string (it might work, depending on the filesystem and OS), next attempt is to try using that Unicode object's .encode('Latin-1') and .encode('utf-8'). If none of the encodings work, information on the OS and filesystem in use, which the OP, I believe, hasn't given yet, becomes crucial.

like image 143
Alex Martelli Avatar answered Oct 07 '22 07:10

Alex Martelli


I have fixed the UnicodeDecodeError by adding these lines to /etc/apache2/envvars and restarting Apache.

export LANG='en_US.UTF-8'
export LC_ALL='en_US.UTF-8'

as described here: https://docs.djangoproject.com/en/dev/howto/deployment/wsgi/modwsgi/#if-you-get-a-unicodeencodeerror

I have spent some time debugging this.

like image 24
Don Grem Avatar answered Oct 07 '22 05:10

Don Grem


filename = filename.decode('utf-8').encode("latin-1")

works for me with the file from Splywaj.zip

>>> os.path.isfile(filename.decode("utf8").encode("latin-1"))
True
>>>
like image 44
YOU Avatar answered Oct 07 '22 05:10

YOU