Since Python 2.2 and PEP 261, Python can be built in "narrow" or "wide" mode, which affects the definition of a "character", i.e. "the addressable unit of a Python Unicode string".
Characters in narrow builds look like UTF-16 code units:
>>> a = u'\N{MAHJONG TILE GREEN DRAGON}'
>>> a
u'\U0001f005'
>>> len(a)
2
>>> a[0], a[1]
(u'\ud83c', u'\udc05')
>>> [hex(ord(c)) for c in a.encode('utf-16be')]
['0xd8', '0x3c', '0xdc', '0x5']
(The above seems to disagree with some sources that insist that narrow builds use UCS-2, not UTF-16. Very intriguing indeed)
Does Python 3.0 keep this distinction? Or are all Python 3 builds wide?
(I've heard about PEP 393 that changes internal representation of strings in 3.3, but this doesn't relate to 3.0 ~ 3.2.)
Yes, from 3.0 to 3.2 they do. Windows uses narrow builds while (most) Unix uses wide builds
Using Python 3.2 on Windows:
>>> a = '\N{MAHJONG TILE GREEN DRAGON}'
>>> len(a)
2
>>> a
'🀅'
While this behavior is expected on 3.3+ using Windows:
>>> a = '\N{MAHJONG TILE GREEN DRAGON}'
>>> len(a)
1
>>> a
'\U0001f005'
>>> print(a)
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
print(a)
UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001f005'
in position 0: Non-BMP character not supported in Tk
The UCS-2 codec is used on Tk (I'm using IDLE - the terminal may show another error).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With