How to get a reliable unicode character count in Python?

Question

Google App Engine uses Python 2.5.2, apparently with UCS4 enabled. But the GAE datastore uses UTF-8 internally. So if you store u'\ud834\udd0c' (length 2) to the datastore, when you retrieve it, you get '\U0001d10c' (length 1). I'm trying to count of the number of unicode characters in the string in a way that gives the same result before and after storing it. So I'm trying to normalize the string (from u'\ud834\udd0c' to '\U0001d10c') as soon as I receive it, before calculating its length and putting it in the datastore. I know I can just encode it to UTF-8 and then decode again, but is there a more straightforward/efficient way?

bobince · Accepted Answer

I know I can just encode it to UTF-8 and then decode again

Yes, that's the usual idiom to fix up the problem when you have “UTF-16 surrogates in UCS-4 string” input. But as Mechanical snail said, this input is malformed and you should be fixing whatever produced it in preference.

is there a more straightforward/efficient way?

Well... you could do it manually with a regex, like:

re.sub(
    u'([\uD800-\uDBFF])([\uDC00-\uDFFF])',
    lambda m: unichr((ord(m.group(1))-0xD800<<10)+ord(m.group(2))-0xDC00+0x10000),
    s
)

Certainly not more straightforward... I also have my doubts as to whether it's actually more efficient!

How to get a reliable unicode character count in Python?

Tags:

python

unicode

utf-16

utf-32

google-app-engine

Travis

1 Answers

bobince

Recent Activity

Donate For Us

How to get a reliable unicode character count in Python?

Tags:

python

unicode

utf-16

utf-32

google-app-engine

Travis

1 Answers

bobince

Related questions

Recent Activity

Donate For Us