Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode encoding for filesystem in Mac OS X not correct in Python?

Having a bit of struggle with Unicode file names in OS X and Python. I am trying to use filenames as input for a regular expression later in the code, but the encoding used in the filenames seem to be different from what sys.getfilesystemencoding() tells me. Take the following code:

#!/usr/bin/env python
# coding=utf-8

import sys,os
print sys.getfilesystemencoding()

p = u'/temp/s/'
s = u'åäö'
print 's', [ord(c) for c in s], s
s2 = s.encode(sys.getfilesystemencoding())
print 's2', [ord(c) for c in s2], s2
os.mkdir(p+s)
for d in os.listdir(p):
  print 'dir', [ord(c) for c in d], d

It outputs the following:

utf-8
s [229, 228, 246] åäö
s2 [195, 165, 195, 164, 195, 182] åäö
dir [97, 778, 97, 776, 111, 776] åäö

So, file system encoding is utf-8, but when I encode my filename åäö using that, it will not be the same as if I create a dir name with the same string. I expect that when I use my string åäö to create a dir, and read it's name back, it should use the same codes as if I applied the encoding directly.

If we look at the code points 97, 778, 97, 776, 111, 776, it's basically ASCII characters with added diacritic, e.g. o + ¨ = ö, which makes it two characters, not one. How can I avoid this discrepancy, is there an encoding scheme in Python that matches this behaviour by OS X, and why is not getfilesystemencoding() giving me the right result?

Or have I messed up?

like image 796
RipperDoc Avatar asked Mar 18 '12 11:03

RipperDoc


People also ask

Does Mac use utf8?

Mac OS Roman is an extension of the original Macintosh character set, which encoded only 217 characters. Full support for Mac OS Roman first appeared in System 6.0. 4, released in 1989, and the encoding is still supported in current versions of macOS, though the standard character encodings are now UTF-8 or UTF-16.

How do you use Unicode characters in Python?

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.

Does Python support Unicode?

Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters.


2 Answers

MacOS X uses a special kind of decomposed UTF-8 to store filenames. If you need to e.g. read in filenames and write them to a "normal" UTF-8 file, you must normalize them :

filename = unicodedata.normalize('NFC', unicode(filename, 'utf-8')).encode('utf-8')

from here: https://web.archive.org/web/20120423075412/http://boodebr.org/main/python/all-about-python-and-unicode

like image 197
sigman Avatar answered Oct 06 '22 16:10

sigman


getfilesystemencoding() is giving you the correct response (the encoding), but it does not tell you the unicode normalisation form.

In particular, the HFS+ filesystem uses UTF-8 encoding, and a normalisation form close to "D" (which requires composed characters like ö to be decomposed into ). HFS+ is also tied to the normalisation form as it existed in Unicode version 3.2—as detailed in Apple's documentation for the HFS+ format.

Python's unicodedata.normalize method converts between forms, and if you prefix the call with the ucd_3_2_0 object, you can constrain it to Unicode version 3.2:

filename = unicodedata.ucd_3_2_0.normalize('NFC', unicode(filename, 'utf-8')).encode('utf-8')
like image 26
一二三 Avatar answered Oct 06 '22 16:10

一二三