Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do Unicode strings in Python 3 still depend on "narrow" / "wide" builds?

Since Python 2.2 and PEP 261, Python can be built in "narrow" or "wide" mode, which affects the definition of a "character", i.e. "the addressable unit of a Python Unicode string".

Characters in narrow builds look like UTF-16 code units:

>>> a = u'\N{MAHJONG TILE GREEN DRAGON}'
>>> a
u'\U0001f005'
>>> len(a)
2
>>> a[0], a[1]
(u'\ud83c', u'\udc05')
>>> [hex(ord(c)) for c in a.encode('utf-16be')]
['0xd8', '0x3c', '0xdc', '0x5']

(The above seems to disagree with some sources that insist that narrow builds use UCS-2, not UTF-16. Very intriguing indeed)

Does Python 3.0 keep this distinction? Or are all Python 3 builds wide?

(I've heard about PEP 393 that changes internal representation of strings in 3.3, but this doesn't relate to 3.0 ~ 3.2.)

like image 635
Kos Avatar asked Feb 09 '13 19:02

Kos


1 Answers

Yes, from 3.0 to 3.2 they do. Windows uses narrow builds while (most) Unix uses wide builds

Using Python 3.2 on Windows:

>>> a = '\N{MAHJONG TILE GREEN DRAGON}'
>>> len(a)
2
>>> a
'🀅'

While this behavior is expected on 3.3+ using Windows:

>>> a = '\N{MAHJONG TILE GREEN DRAGON}'
>>> len(a)
1
>>> a
'\U0001f005'
>>> print(a)
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    print(a)
UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001f005' 
in position 0: Non-BMP character not supported in Tk

The UCS-2 codec is used on Tk (I'm using IDLE - the terminal may show another error).

like image 183
JBernardo Avatar answered Oct 16 '22 23:10

JBernardo