Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iterate over utf-8 characters in python

I am using python 3.6 to read a file encoded in utf-8, in Spanish (thus, including letter ñ). I open the file with the utf-8 codec, and it loads correctly: while debugging, I can see ñ in the loaded text.

However, when I iterate over characters, ñ is read as two characters, n and ~. Concretely, when I run:

for c in text:
        hexc = int(hex(ord(c)), 16)
        if U_LETTERS[lang][0] <= hexc <= U_LETTERS[lang][1] \
            or hexc in U_LETTERS[lang][2:] \
            or hexc == U_SPACE:
                filtered_text+=c

and text includes an ñ, the variable c takes it as an n (and therefore, hexc is 110 instead of 241), and then it takes ~ (and hexc is 771). I guess there is an internal conversion to an 8 bit char when iterating in this way. What is the proper way to do this?

Thanks in advance.

like image 449
rgalhama Avatar asked Feb 05 '26 16:02

rgalhama


1 Answers

This has to do with Unicode normalisation. The letter "ñ" can be expressed either with a single character with the codepoint 0xF1 (241), or with the two character "n" and a combining character for the superposed tilde, ie. the codepoints 0x6E and 0x0303 (110 and 771).

These two ways of expressing the letter are considered equivalent; however, they are not the same in string comparison. Python provides functionality to convert from one form to the other by means of the unicodedata module. The first form is called composed (NFC), the second one decomposed (NFD) normalised form.

An example explains it the easiest way:

>>> import unicodedata
>>> '\xf1'
'ñ'
>>> [ord(c) for c in '\xf1']
[241]
>>> [ord(c) for c in unicodedata.normalize('NFD', '\xf1')]
[110, 771]
>>> [ord(c) for c in unicodedata.normalize('NFC', 'n\u0303')]
[241]
>>> 

So, to solve your problem, convert all of the text to the desired normalisation form before any further processing.

Note: Unicode normalisation is a problem separate from encoding. You can have this with UTF16 or UTF32 just as well. In the decomposed form, you actually have two (or more) characters (each of which might be represented with multiple bytes, depending on the encoding). It's up the displaying device (the terminal emulator, an editor...) to show this as a single letter with marks above/below the base character.

like image 69
lenz Avatar answered Feb 07 '26 05:02

lenz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!