normalization misses polish characters

Question

I have a big dataframe with people data. I would like to flatten all weird diacritics and convert them to the closest ascii character. Based on a solution I found in SO I do the following:

for column in df.columns:
            df[column] = df[column].astype("str").str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')

It works for most of the cases (haven't checked them all) however I have noticed it misses letter 'ł' in Polish. For example Lech Wałęsa is translated to Lech Waesa while my expectation would be to see Lech Walesa. My guess would be that it's what ignore parameter does in str.encode method. Any idea how to make it work for any diacritic?

Grzegorz Melniczak · Accepted Answer

Try using unidecode, worked perfectly for the example you described.

from unidecode import unidecode

for column in df.columns:
   df[column] = [unidecode(x) for x in df[column].values]

Schmuddi · Answer

Look at what normalize('NFKD') actually does to your input string "Lech Wałęsa":

import unicodedata
s = "Lech Wałęsa"
print(list(unicodedata.normalize("NFKD", s)))

['L', 'e', 'c', 'h', ' ', 'W', 'a', 'ł', 'e', '̨', 's', 'a']

As you can see, the character 'ę' is decomposed into the letter 'e' and the diacritic '̨', but no such decomposition takes place for 'ł'. It seems that the unicode characters LATIN SMALL LETTER L WITH STROKE 'ł' and LATIN CAPITAL LETTER L WITH STROKE 'Ł' are among the few characters for which unicode decomposition doesn't work.

Now, when you use the output of normalize("NFKD") as input for encode("ascii", errors="ignore"), what happens is that all characters that can't be expressed as ASCII characters are silently ignored, which gives you the output "Lech Waesa".

What you could do to solve this problem is to replace the exceptional 'Ł' and 'ł' manually by 'L' and 'l' before you remove the diacritics from the characters that support decomposition. You'd have to change your code like so:

for column in df.columns:
    df[column] = (df[column].astype("str")
                            .str.replace("ł", "l")
                            .str.replace("Ł", "L")
                            .str.normalize('NFKD')
                            .str.encode('ascii', errors='ignore')
                            .str.decode('utf-8'))

normalization misses polish characters

Tags:

python

python-3.x

pandas

pawelty

2 Answers

Grzegorz Melniczak

Schmuddi

Recent Activity

Donate For Us

normalization misses polish characters

Tags:

python

python-3.x

pandas

pawelty

2 Answers

Grzegorz Melniczak

Schmuddi

Related questions

Recent Activity

Donate For Us