Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does python support unicode beyond basic multilingual plane?

Below is a simple test. repr seems to work fine. yet len and x for x in doesn't seem to divide the unicode text correctly in Python 2.6 and 2.7:

In [1]: u"爨爵"
Out[1]: u'\U0002f920\U0002f921'

In [2]: [x for x in u"爨爵"]
Out[2]: [u'\ud87e', u'\udd20', u'\ud87e', u'\udd21']

Good news is Python 3.3 does the right thing ™.

Is there any hope for Python 2.x series?

like image 408
Dima Tisnek Avatar asked Oct 15 '13 18:10

Dima Tisnek


1 Answers

Yes, provided you compiled your Python with wide-unicode support.

By default, Python is built with narrow unicode support only. Enable wide support with:

./configure --enable-unicode=ucs4

You can verify what configuration was used by testing sys.maxunicode:

import sys
if sys.maxunicode == 0x10FFFF:
    print 'Python built with UCS4 (wide unicode) support'
else:
    print 'Python built with UCS2 (narrow unicode) support'

A wide build will use UCS4 characters for all unicode values, doubling memory usage for these. Python 3.3 switched to variable width values; only enough bytes are used to represent all characters in the current value.

Quick demo showing that a wide build handles your sample Unicode string correctly:

$ python2.6
Python 2.6.6 (r266:84292, Dec 27 2010, 00:02:40) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
1114111
>>> [x for x in u'\U0002f920\U0002f921']
[u'\U0002f920', u'\U0002f921']
like image 64
Martijn Pieters Avatar answered Oct 20 '22 00:10

Martijn Pieters