Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decoding Mac OS text in Python

Tags:

python

macos

I'm writing some code to parse RTF documents, and need to handle the various codepages they can use. Python comes with decoders for all the necessary Windows codepages, but I'm not sure how to handle the Mac ones:

# 77: "10000", # Mac Roman
# 78: "10001", # Mac Shift Jis
# 79: "10003", # Mac Hangul
# 80: "10008", # Mac GB2312
# 81: "10002", # Mac Big5
# 83: "10005", # Mac Hebrew
# 84: "10004", # Mac Arabic
# 85: "10006", # Mac Greek
# 86: "10081", # Mac Turkish
# 87: "10021", # Mac Thai
# 88: "10029", # Mac East Europe
# 89: "10007", # Mac Russian

Does Python have any built-in support for these? If not, is there a cross-platform pure-Python library that will handle them?

like image 734
Brendon Avatar asked Oct 20 '09 07:10

Brendon


2 Answers

You can use the python codecs for these that are known by their names 'mac-roman', 'mac-turkish', etc.

>>> 'foo'.decode('mac-turkish')
u'foo'

You'll have to refer to them by their names, these numbers you've got in your question don't appear in the source files. For more information look at $pylib/encodings/mac_*.py.

like image 58
Jerub Avatar answered Oct 11 '22 21:10

Jerub


It seems that at least Mac Roman and Mac Turkish encodings exist in Python stdlib, under names macroman and macturkish. See http://svn.python.org/projects/python/trunk/Lib/encodings/aliases.py for a complete list of encoding aliases in the most up-to-date Python.

like image 28
Tuure Laurinolli Avatar answered Oct 11 '22 21:10

Tuure Laurinolli