Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 2.7.6 splits single "high" unicode code point in two

As a workaround for MySQL truncating unicode strings when encountering "high" (ordinal >= 2^16) code points, I've been using a little Python method that steps through the string (strings are sequences, remember), does ord() on the character, and preempts the truncation, either by substituting something else, or removing the code point outright. This has been working as expected on many machines with Python 2.7.3 (Ubuntu 12.04 LTS, some Centos 6, mixed 32 bit and 64 bit CPUs, hasn't mattered so far).

I've noticed that on a Python 2.7.6 install, this breaks. Ascii chars and "low" codepoints (ordinal < 2^16) behave as before. But the high codepoints (>= 2^16) are behaving really strangely. Python2.7.6 appears to treat them as two codepoints each. Here's a test case boiled down to the basics:

### "good" machine, Python2.7.3
$ uname -a && echo $LANG
Linux *** 3.2.0-60-virtual #91-Ubuntu SMP Wed Feb 19 04:13:28 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
en_US.UTF-8
$ python2.7
Python 2.7.3 (default, Feb 27 2014, 19:58:35) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> utest = u'a\u0395\U0001f30e'    # three chars: ascii, "low" codepoint, "high" codepoint
>>> utest.__class__
<type 'unicode'>
>>> len(utest), hash(utest)
(3, 1453079728409075183)
>>> list(utest)        # split into list of single chars
[u'a', u'\u0395', u'\U0001f30e']
>>> utest[2]   # trying to extract third char (high codepoint)
u'\U0001f30e'
>>> len(utest[2])
1
>>> "%x" % ord(utest[2])
'1f30e'

This is the expected behaviour. I initialize a unicode string with three chars. Python says it's three characters, and it can "address" the third character fine, returning the single expected high codepoint. If I get the ordinal of that codepoint, I get the same numer as seen in the original escape sequence.

Now comes Python 2.7.6

### "bad" machine, Python 2.7.6
$ uname -a && echo $LANG
Linux *** 2.6.32-431.5.1.el6.x86_64 #1 SMP Wed Feb 12 00:41:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
en_US.UTF-8
$ python2.7
Python 2.7.6 (default, Jan 29 2014, 20:05:36)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> utest = u'a\u0395\U0001f30e'
>>> utest.__class__
<type 'unicode'>
>>> len(utest), hash(utest)    # !!!
(4, -2836525916470507760)

First discrepancy: Python 2.7.6 says utest has length 4. The hash is also different. Next surprise:

>>> list(utest)                # !!!
[u'a', u'\u0395', u'\ud83c', u'\udf0e']

Not only is the length behaving weird, the splitting into single chars is even weirder, as the two "halfs" of the high codepoint get turned into two low codepoints with no obvious numeric relation -- at least to me -- to the original codepoint.

Adressing that code point by sequence index exhibits the same breakage:

>>> utest[2]
u'\ud83c'

To get the original high codepoint, I now have to use a two-character slice:

>>> utest[2:4]
u'\U0001f30e'

But, if it wasn't obvious, Python2.7.6 still internally treats this as two codepoints. I have no way of getting a single ordinal from that.

>>> len(utest[2:4])
2
>>> "%x" % ord(utest[2:4])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

So, what to do? The code I have depends on ordinals of codepoints within a unicode string. If a codepoint is sometimes really two codepoints, my ordinals become meaningless, and my code can't perform its function.

Is there a rationale for this behaviour? Is it an intentional change? Is there some config knob I can turn to restore the old behaviour, inside Python, or on the system level? A monkey patch? I don't know where to look.

I can't even narrow it down to the exact minor release unfortunately. We have a lot of 2.7.3, a few 2.7.1, and a couple 2.7.6 installations. No 2.7.4 / 2.7.5. All I can say is that I've never had this problem on any 2.7.3 install.

Bonus info: encoding the string to utf8 yields the exact same response from both Python versions (same chars, same length, same hash). Decoding that encoded utf8 again still returns me right back to square 1 (i.e. it's not a workaround, behaviour remains divergent in unicode space).

like image 337
Rolf NB Avatar asked Oct 21 '22 11:10

Rolf NB


1 Answers

You are experiencing what is known as "surrogate pairs". These only happen on narrow builds of python, where the codepoints are internally stored as UTF-16. You can confirm which build you have by checking sys.maxunicode (it will be 2**16 - 1).

Some other good reading is PEP 393, which puts this to rest... for python 3.3+, unfortunately.

Edit: googled for a workaround. Full credit to @dan04.

def code_points(text):
    import struct
    utf32 = text.encode('UTF-32LE')
    return struct.unpack('<{}I'.format(len(utf32) // 4), utf32)

>>> len(utest)
4
>>> len(code_points(utest))
3

If you only care about the length you can do len(utest.encode('UTF-32LE')) // 4, but it seems like you want to do more, so perhaps the above function is helpful.

like image 93
roippi Avatar answered Nov 03 '22 06:11

roippi