Below is a simple test. repr
seems to work fine. yet len
and x for x in
doesn't seem to divide the unicode text correctly in Python 2.6 and 2.7:
In [1]: u"爨爵"
Out[1]: u'\U0002f920\U0002f921'
In [2]: [x for x in u"爨爵"]
Out[2]: [u'\ud87e', u'\udd20', u'\ud87e', u'\udd21']
Good news is Python 3.3 does the right thing ™.
Is there any hope for Python 2.x series?
Yes, provided you compiled your Python with wide-unicode support.
By default, Python is built with narrow unicode support only. Enable wide support with:
./configure --enable-unicode=ucs4
You can verify what configuration was used by testing sys.maxunicode
:
import sys
if sys.maxunicode == 0x10FFFF:
print 'Python built with UCS4 (wide unicode) support'
else:
print 'Python built with UCS2 (narrow unicode) support'
A wide build will use UCS4 characters for all unicode values, doubling memory usage for these. Python 3.3 switched to variable width values; only enough bytes are used to represent all characters in the current value.
Quick demo showing that a wide build handles your sample Unicode string correctly:
$ python2.6
Python 2.6.6 (r266:84292, Dec 27 2010, 00:02:40)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
1114111
>>> [x for x in u'\U0002f920\U0002f921']
[u'\U0002f920', u'\U0002f921']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With