Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to replace accents in a column of a pandas dataframe

I have a dataframe dataSwiss which contains the information Swiss municipalities. I want to replace the letter with accents with normal letter.

This is what I am doing:

dataSwiss['Municipality'] = dataSwiss['Municipality'].str.encode('utf-8')
dataSwiss['Municipality'] = dataSwiss['Municipality'].str.replace(u"é", "e")

but I get the following error:

----> 2 dataSwiss['Municipality'] = dataSwiss['Municipality'].str.replace(u"é", "e")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

data looks like:

dataSwiss.Municipality
0               Zürich
1               Zürich
2               Zürich
3               Zürich
4               Zürich
5               Zürich
6               Zürich
7               Zürich

I found the solution

s = dataSwiss['Municipality']
res = s.str.decode('utf-8')
res = res.str.replace(u"é", "e")
like image 323
emax Avatar asked May 09 '18 12:05

emax


2 Answers

This is one way. You can convert to byte literal first before decoding to utf-8.

s = pd.Series(['hello', 'héllo', 'Zürich', 'Zurich'])

res = s.str.normalize('NFKD')\
       .str.encode('ascii', errors='ignore')\
       .str.decode('utf-8')

print(res)

0     hello
1     hello
2    Zurich
3    Zurich
dtype: object

pd.Series.str.normalize uses unicodedata module. As per the docs:

The normal form KD (NFKD) will apply the compatibility decomposition, i.e. replace all compatibility characters with their equivalents.

like image 117
jpp Avatar answered Sep 24 '22 18:09

jpp


Try unidecode module.

Ex:

import unidecode
dataSwiss['Municipality'] = dataSwiss['Municipality'].apply(unidecode.unidecode)

Or:

import unicodedata
def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii

dataSwiss['Municipality'] = dataSwiss['Municipality'].apply(remove_accents)

Note: The function is from this link

Update as per comment

dataSwiss['Municipality'] = dataSwiss['Municipality'].apply(unicode).apply(remove_accents)
like image 23
Rakesh Avatar answered Sep 22 '22 18:09

Rakesh