I need to step through a Python string one character at a time, but a simple "for" loop gives me UTF-16 code units instead:
str = "abc\u20ac\U00010302\U0010fffd"
for ch in str:
code = ord(ch)
print("U+{:04X}".format(code))
That prints:
U+0061
U+0062
U+0063
U+20AC
U+D800
U+DF02
U+DBFF
U+DFFD
when what I wanted was:
U+0061
U+0062
U+0063
U+20AC
U+10302
U+10FFFD
Is there any way to get Python to give me the sequence of Unicode code points, regardless of how the string is actually encoded under the hood? I'm testing on Windows here, but I need code that will work anywhere. It only needs to work on Python 3, I don't care about Python 2.x.
The best I've been able to come up with so far is this:
import codecs
str = "abc\u20ac\U00010302\U0010fffd"
bytestr, _ = codecs.getencoder("utf_32_be")(str)
for i in range(0, len(bytestr), 4):
code = 0
for b in bytestr[i:i + 4]:
code = (code << 8) + b
print("U+{:04X}".format(code))
But I'm hoping there's a simpler way.
(Pedantic nitpicking over precise Unicode terminology will be ruthlessly beaten over the head with a clue-by-four. I think I've made it clear what I'm after here, please don't waste space with "but UTF-16 is technically Unicode too" kind of arguments.)
On Python 3.2.1 with narrow Unicode build:
PythonWin 3.2.1 (default, Jul 10 2011, 21:51:15) [MSC v.1500 32 bit (Intel)] on win32.
Portions Copyright 1994-2008 Mark Hammond - see 'Help/About PythonWin' for further copyright information.
>>> import sys
>>> sys.maxunicode
65535
What you've discovered (UTF-16 encoding):
>>> s = "abc\u20ac\U00010302\U0010fffd"
>>> len(s)
8
>>> for c in s:
... print('U+{:04X}'.format(ord(c)))
...
U+0061
U+0062
U+0063
U+20AC
U+D800
U+DF02
U+DBFF
U+DFFD
A way around it:
>>> import struct
>>> s=s.encode('utf-32-be')
>>> struct.unpack('>{}L'.format(len(s)//4),s)
(97, 98, 99, 8364, 66306, 1114109)
>>> for i in struct.unpack('>{}L'.format(len(s)//4),s):
... print('U+{:04X}'.format(i))
...
U+0061
U+0062
U+0063
U+20AC
U+10302
U+10FFFD
Now it works the way the OP expects:
>>> s = "abc\u20ac\U00010302\U0010fffd"
>>> len(s)
6
>>> for c in s:
... print('U+{:04X}'.format(ord(c)))
...
U+0061
U+0062
U+0063
U+20AC
U+10302
U+10FFFD
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With