Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to iterate over Unicode characters in Python 3?

I need to step through a Python string one character at a time, but a simple "for" loop gives me UTF-16 code units instead:

str = "abc\u20ac\U00010302\U0010fffd"
for ch in str:
    code = ord(ch)
    print("U+{:04X}".format(code))

That prints:

U+0061
U+0062
U+0063
U+20AC
U+D800
U+DF02
U+DBFF
U+DFFD

when what I wanted was:

U+0061
U+0062
U+0063
U+20AC
U+10302
U+10FFFD

Is there any way to get Python to give me the sequence of Unicode code points, regardless of how the string is actually encoded under the hood? I'm testing on Windows here, but I need code that will work anywhere. It only needs to work on Python 3, I don't care about Python 2.x.

The best I've been able to come up with so far is this:

import codecs
str = "abc\u20ac\U00010302\U0010fffd"
bytestr, _ = codecs.getencoder("utf_32_be")(str)
for i in range(0, len(bytestr), 4):
    code = 0
    for b in bytestr[i:i + 4]:
        code = (code << 8) + b
    print("U+{:04X}".format(code))

But I'm hoping there's a simpler way.

(Pedantic nitpicking over precise Unicode terminology will be ruthlessly beaten over the head with a clue-by-four. I think I've made it clear what I'm after here, please don't waste space with "but UTF-16 is technically Unicode too" kind of arguments.)

like image 208
Ross Smith Avatar asked Sep 21 '11 02:09

Ross Smith


1 Answers

On Python 3.2.1 with narrow Unicode build:

PythonWin 3.2.1 (default, Jul 10 2011, 21:51:15) [MSC v.1500 32 bit (Intel)] on win32.
Portions Copyright 1994-2008 Mark Hammond - see 'Help/About PythonWin' for further copyright information.
>>> import sys
>>> sys.maxunicode
65535

What you've discovered (UTF-16 encoding):

>>> s = "abc\u20ac\U00010302\U0010fffd"
>>> len(s)
8
>>> for c in s:
...     print('U+{:04X}'.format(ord(c)))
...     
U+0061
U+0062
U+0063
U+20AC
U+D800
U+DF02
U+DBFF
U+DFFD

A way around it:

>>> import struct
>>> s=s.encode('utf-32-be')
>>> struct.unpack('>{}L'.format(len(s)//4),s)
(97, 98, 99, 8364, 66306, 1114109)
>>> for i in struct.unpack('>{}L'.format(len(s)//4),s):
...     print('U+{:04X}'.format(i))
...     
U+0061
U+0062
U+0063
U+20AC
U+10302
U+10FFFD

Update for Python 3.3:

Now it works the way the OP expects:

>>> s = "abc\u20ac\U00010302\U0010fffd"
>>> len(s)
6
>>> for c in s:
...     print('U+{:04X}'.format(ord(c)))
...     
U+0061
U+0062
U+0063
U+20AC
U+10302
U+10FFFD
like image 117
Mark Tolonen Avatar answered Oct 08 '22 17:10

Mark Tolonen