Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

normalization misses polish characters

I have a big dataframe with people data. I would like to flatten all weird diacritics and convert them to the closest ascii character. Based on a solution I found in SO I do the following:

for column in df.columns:
            df[column] = df[column].astype("str").str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')

It works for most of the cases (haven't checked them all) however I have noticed it misses letter 'ł' in Polish. For example Lech Wałęsa is translated to Lech Waesa while my expectation would be to see Lech Walesa. My guess would be that it's what ignore parameter does in str.encode method. Any idea how to make it work for any diacritic?

like image 220
pawelty Avatar asked Dec 06 '22 15:12

pawelty


2 Answers

Try using unidecode, worked perfectly for the example you described.

from unidecode import unidecode

for column in df.columns:
   df[column] = [unidecode(x) for x in df[column].values]
like image 200
Grzegorz Melniczak Avatar answered Dec 21 '22 05:12

Grzegorz Melniczak


Look at what normalize('NFKD') actually does to your input string "Lech Wałęsa":

import unicodedata
s = "Lech Wałęsa"
print(list(unicodedata.normalize("NFKD", s)))

['L', 'e', 'c', 'h', ' ', 'W', 'a', 'ł', 'e', '̨', 's', 'a']

As you can see, the character 'ę' is decomposed into the letter 'e' and the diacritic '̨', but no such decomposition takes place for 'ł'. It seems that the unicode characters LATIN SMALL LETTER L WITH STROKE 'ł' and LATIN CAPITAL LETTER L WITH STROKE 'Ł' are among the few characters for which unicode decomposition doesn't work.

Now, when you use the output of normalize("NFKD") as input for encode("ascii", errors="ignore"), what happens is that all characters that can't be expressed as ASCII characters are silently ignored, which gives you the output "Lech Waesa".

What you could do to solve this problem is to replace the exceptional 'Ł' and 'ł' manually by 'L' and 'l' before you remove the diacritics from the characters that support decomposition. You'd have to change your code like so:

for column in df.columns:
    df[column] = (df[column].astype("str")
                            .str.replace("ł", "l")
                            .str.replace("Ł", "L")
                            .str.normalize('NFKD')
                            .str.encode('ascii', errors='ignore')
                            .str.decode('utf-8'))
like image 24
Schmuddi Avatar answered Dec 21 '22 07:12

Schmuddi