I have a big dataframe with people data. I would like to flatten all weird diacritics and convert them to the closest ascii character. Based on a solution I found in SO I do the following:
for column in df.columns:
df[column] = df[column].astype("str").str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
It works for most of the cases (haven't checked them all) however I have noticed it misses letter 'ł' in Polish. For example Lech Wałęsa
is translated to Lech Waesa
while my expectation would be to see Lech Walesa
. My guess would be that it's what ignore
parameter does in str.encode
method. Any idea how to make it work for any diacritic?
Try using unidecode, worked perfectly for the example you described.
from unidecode import unidecode
for column in df.columns:
df[column] = [unidecode(x) for x in df[column].values]
Look at what normalize('NFKD')
actually does to your input string "Lech Wałęsa"
:
import unicodedata
s = "Lech Wałęsa"
print(list(unicodedata.normalize("NFKD", s)))
['L', 'e', 'c', 'h', ' ', 'W', 'a', 'ł', 'e', '̨', 's', 'a']
As you can see, the character 'ę' is decomposed into the letter 'e' and the diacritic '̨', but no such decomposition takes place for 'ł'. It seems that the unicode characters LATIN SMALL LETTER L WITH STROKE 'ł' and LATIN CAPITAL LETTER L WITH STROKE 'Ł' are among the few characters for which unicode decomposition doesn't work.
Now, when you use the output of normalize("NFKD")
as input for encode("ascii", errors="ignore")
, what happens is that all characters that can't be expressed as ASCII characters are silently ignored, which gives you the output "Lech Waesa"
.
What you could do to solve this problem is to replace the exceptional 'Ł' and 'ł' manually by 'L' and 'l' before you remove the diacritics from the characters that support decomposition. You'd have to change your code like so:
for column in df.columns:
df[column] = (df[column].astype("str")
.str.replace("ł", "l")
.str.replace("Ł", "L")
.str.normalize('NFKD')
.str.encode('ascii', errors='ignore')
.str.decode('utf-8'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With