Iterate over utf-8 characters in python

Question

I am using python 3.6 to read a file encoded in utf-8, in Spanish (thus, including letter ñ). I open the file with the utf-8 codec, and it loads correctly: while debugging, I can see ñ in the loaded text.

However, when I iterate over characters, ñ is read as two characters, n and ~. Concretely, when I run:

for c in text:
        hexc = int(hex(ord(c)), 16)
        if U_LETTERS[lang][0] <= hexc <= U_LETTERS[lang][1] \
            or hexc in U_LETTERS[lang][2:] \
            or hexc == U_SPACE:
                filtered_text+=c

and text includes an ñ, the variable c takes it as an n (and therefore, hexc is 110 instead of 241), and then it takes ~ (and hexc is 771). I guess there is an internal conversion to an 8 bit char when iterating in this way. What is the proper way to do this?

Thanks in advance.

lenz · Accepted Answer

This has to do with Unicode normalisation. The letter "ñ" can be expressed either with a single character with the codepoint 0xF1 (241), or with the two character "n" and a combining character for the superposed tilde, ie. the codepoints 0x6E and 0x0303 (110 and 771).

These two ways of expressing the letter are considered equivalent; however, they are not the same in string comparison. Python provides functionality to convert from one form to the other by means of the unicodedata module. The first form is called composed (NFC), the second one decomposed (NFD) normalised form.

An example explains it the easiest way:

>>> import unicodedata
>>> '\xf1'
'ñ'
>>> [ord(c) for c in '\xf1']
[241]
>>> [ord(c) for c in unicodedata.normalize('NFD', '\xf1')]
[110, 771]
>>> [ord(c) for c in unicodedata.normalize('NFC', 'n\u0303')]
[241]
>>>

So, to solve your problem, convert all of the text to the desired normalisation form before any further processing.

Note: Unicode normalisation is a problem separate from encoding. You can have this with UTF16 or UTF32 just as well. In the decomposed form, you actually have two (or more) characters (each of which might be represented with multiple bytes, depending on the encoding). It's up the displaying device (the terminal emulator, an editor...) to show this as a single letter with marks above/below the base character.

Iterate over utf-8 characters in python

Tags:

python

character-encoding

utf-8

python-3.6

rgalhama

1 Answers

lenz

Recent Activity

Donate For Us

Iterate over utf-8 characters in python

Tags:

python

character-encoding

utf-8

python-3.6

rgalhama

1 Answers

lenz

Related questions

Recent Activity

Donate For Us