Unicode encoding for filesystem in Mac OS X not correct in Python?

Tags:

Having a bit of struggle with Unicode file names in OS X and Python. I am trying to use filenames as input for a regular expression later in the code, but the encoding used in the filenames seem to be different from what sys.getfilesystemencoding() tells me. Take the following code:

#!/usr/bin/env python
# coding=utf-8

import sys,os
print sys.getfilesystemencoding()

p = u'/temp/s/'
s = u'åäö'
print 's', [ord(c) for c in s], s
s2 = s.encode(sys.getfilesystemencoding())
print 's2', [ord(c) for c in s2], s2
os.mkdir(p+s)
for d in os.listdir(p):
  print 'dir', [ord(c) for c in d], d

It outputs the following:

utf-8
s [229, 228, 246] åäö
s2 [195, 165, 195, 164, 195, 182] åäö
dir [97, 778, 97, 776, 111, 776] åäö

So, file system encoding is utf-8, but when I encode my filename åäö using that, it will not be the same as if I create a dir name with the same string. I expect that when I use my string åäö to create a dir, and read it's name back, it should use the same codes as if I applied the encoding directly.

If we look at the code points 97, 778, 97, 776, 111, 776, it's basically ASCII characters with added diacritic, e.g. o + ¨ = ö, which makes it two characters, not one. How can I avoid this discrepancy, is there an encoding scheme in Python that matches this behaviour by OS X, and why is not getfilesystemencoding() giving me the right result?

Or have I messed up?

796

asked Mar 18 '12 11:03

RipperDoc

2 Answers

MacOS X uses a special kind of decomposed UTF-8 to store filenames. If you need to e.g. read in filenames and write them to a "normal" UTF-8 file, you must normalize them :

filename = unicodedata.normalize('NFC', unicode(filename, 'utf-8')).encode('utf-8')

from here: https://web.archive.org/web/20120423075412/http://boodebr.org/main/python/all-about-python-and-unicode

197

answered Oct 06 '22 16:10

sigman

getfilesystemencoding() is giving you the correct response (the encoding), but it does not tell you the unicode normalisation form.

In particular, the HFS+ filesystem uses UTF-8 encoding, and a normalisation form close to "D" (which requires composed characters like ö to be decomposed into o¨). HFS+ is also tied to the normalisation form as it existed in Unicode version 3.2—as detailed in Apple's documentation for the HFS+ format.

Python's unicodedata.normalize method converts between forms, and if you prefix the call with the ucd_3_2_0 object, you can constrain it to Unicode version 3.2:

filename = unicodedata.ucd_3_2_0.normalize('NFC', unicode(filename, 'utf-8')).encode('utf-8')

answered Oct 06 '22 16:10

一二三

Related questions
                            
                                How can I tell if Python setuptools is installed?
                            
                                Django - How to use custom template tag with 'if' and 'else' checks? [duplicate]
                            
                                Right function for normalizing input of sklearn SVM
                            
                                Selenium: Element not clickable ... Other Element Would Receive Click
                            
                                Why is a list access O(1) in Python?
                            
                                Leveraging "Copy-on-Write" to Copy Data to Multiprocessing.Pool() Worker Processes
                            
                                How to align axis label to the right or top in matplotlib?
                            
                                How to parse tsv file with python?
                            
                                list comprehension for multiple return function?
                            
                                Associating string representations with an Enum that uses integer values?
                            
                                Pytest - how to skip tests unless you declare an option/flag?
                            
                                Correlation matrix plot with coefficients on one side, scatterplots on another, and distributions on diagonal
                            
                                Should Python unittests be in a separate module?
                            
                                What does "lambda" mean in Python, and what's the simplest way to use it?
                            
                                load python code at runtime
                            
                                python string format suppress/silent keyerror/indexerror [duplicate]
                            
                                Improving Performance of Django ForeignKey Fields in Admin
                            
                                Django admin display multiple fields on the same line
                            
                                Dynamic choices field in Django Models
                            
                                How can I include a python package with Hadoop streaming job?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unicode encoding for filesystem in Mac OS X not correct in Python?

Tags:

python

file-io

filesystems

macos

unicode

RipperDoc

People also ask

2 Answers

sigman

一二三

Recent Activity

Donate For Us