Having a bit of struggle with Unicode file names in OS X and Python. I am trying to use filenames as input for a regular expression later in the code, but the encoding used in the filenames seem to be different from what sys.getfilesystemencoding() tells me. Take the following code:
#!/usr/bin/env python
# coding=utf-8
import sys,os
print sys.getfilesystemencoding()
p = u'/temp/s/'
s = u'åäö'
print 's', [ord(c) for c in s], s
s2 = s.encode(sys.getfilesystemencoding())
print 's2', [ord(c) for c in s2], s2
os.mkdir(p+s)
for d in os.listdir(p):
print 'dir', [ord(c) for c in d], d
It outputs the following:
utf-8
s [229, 228, 246] åäö
s2 [195, 165, 195, 164, 195, 182] åäö
dir [97, 778, 97, 776, 111, 776] åäö
So, file system encoding is utf-8, but when I encode my filename åäö using that, it will not be the same as if I create a dir name with the same string. I expect that when I use my string åäö to create a dir, and read it's name back, it should use the same codes as if I applied the encoding directly.
If we look at the code points 97, 778, 97, 776, 111, 776, it's basically ASCII characters with added diacritic, e.g. o + ¨ = ö, which makes it two characters, not one. How can I avoid this discrepancy, is there an encoding scheme in Python that matches this behaviour by OS X, and why is not getfilesystemencoding() giving me the right result?
Or have I messed up?
Mac OS Roman is an extension of the original Macintosh character set, which encoded only 217 characters. Full support for Mac OS Roman first appeared in System 6.0. 4, released in 1989, and the encoding is still supported in current versions of macOS, though the standard character encodings are now UTF-8 or UTF-16.
In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.
Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters.
MacOS X uses a special kind of decomposed UTF-8 to store filenames. If you need to e.g. read in filenames and write them to a "normal" UTF-8 file, you must normalize them :
filename = unicodedata.normalize('NFC', unicode(filename, 'utf-8')).encode('utf-8')
from here: https://web.archive.org/web/20120423075412/http://boodebr.org/main/python/all-about-python-and-unicode
getfilesystemencoding()
is giving you the correct response (the encoding), but it does not tell you the unicode normalisation form.
In particular, the HFS+ filesystem uses UTF-8 encoding, and a normalisation form close to "D" (which requires composed characters like ö
to be decomposed into o¨
). HFS+ is also tied to the normalisation form as it existed in Unicode version 3.2—as detailed in Apple's documentation for the HFS+ format.
Python's unicodedata.normalize
method converts between forms, and if you prefix the call with the ucd_3_2_0
object, you can constrain it to Unicode version 3.2:
filename = unicodedata.ucd_3_2_0.normalize('NFC', unicode(filename, 'utf-8')).encode('utf-8')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With