CPython stores unicode strings as either utf-16 or utf-32 internally depending on compile options. In utf-16 builds of Python string slicing, iteration, and len
seem to work on code units, not code points, so that multibyte characters behave strangely.
E.g., on CPython 2.6 with sys.maxunicode
= 65535:
>>> char = u'\U0001D49E'
>>> len(char)
2
>>> char[0:1]
u'\uu835'
>>> char[1:2]
u'\udc9e'
According to the Python documentation, sys.maxunicode
is "An integer giving the largest supported code point for a Unicode character."
Does this mean that unicode
operations aren't guranteed to work on code points beyond sys.maxunicode
? If I want to work with characters outside the BMP I either have to use a utf-32 build or write my own portable unicode
operations?
I came across this problem in How to iterate over Unicode characters in Python 3?
Characters beyond sys.maxunicode=65535
are stored internally using UTF-16 surrogates. Yes you have to deal with this yourself or use a wide build. Even with a wide build you also may have to deal with single characters represented by a combination of code points. For example:
>>> print('a\u0301')
á
>>> print('\xe1')
á
The first uses a combining accent character and the second doesn't. Both print the same. You can use unicodedata.normalize
to convert the forms.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With