Two python interpreter sessions. The first is from python on CentOS. The second is from the built-in python on Mac OS X 10.7. Why does the second session create strings of length two from the \U escape sequence, and subsequently error out?
$ python
Python 2.6.6 (r266:84292, Dec 7 2011, 20:48:22)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'\U00000020'
u' '
>>> u'\U00000065'
u'e'
>>> u'\U0000FFFF'
u'\uffff'
>>> u'\U00010000'
u'\U00010000'
>>> len(u'\U00010000')
1
>>> ord(u'\U00010000')
65536
$ python
Python 2.6.7 (r267:88850, Jul 31 2011, 19:30:54)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
>>> u'\U00000020'
u' '
>>> u'\U00000065'
u'e'
>>> u'\U0000FFFF'
u'\uffff'
>>> u'\U00010000'
u'\U00010000'
>>> len(u'\U00010000')
2
>>> ord(u'\U00010000')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found
I'm not at all sure about this, but it may be that your Mac OS X system uses a "narrow build" of python that represents unicode with only 16 bits for internal encoding of unicode, and represents the unicode code points above 2**16 as a character pair (which would explain len(u'\U00010000') == 2
.
Try unichr(0x10000)
on OS X and see if you get an error referring to narrow builds. See also What encoding do normal python strings use?, in particular IVH's answer.
It's possible to recompile python to use a wide build even if the default python on your system uses a narrow build.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With