I have a dataframe dataSwiss
which contains the information Swiss municipalities. I want to replace the letter with accents with normal letter.
This is what I am doing:
dataSwiss['Municipality'] = dataSwiss['Municipality'].str.encode('utf-8')
dataSwiss['Municipality'] = dataSwiss['Municipality'].str.replace(u"é", "e")
but I get the following error:
----> 2 dataSwiss['Municipality'] = dataSwiss['Municipality'].str.replace(u"é", "e")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
data looks like:
dataSwiss.Municipality
0 Zürich
1 Zürich
2 Zürich
3 Zürich
4 Zürich
5 Zürich
6 Zürich
7 Zürich
I found the solution
s = dataSwiss['Municipality']
res = s.str.decode('utf-8')
res = res.str.replace(u"é", "e")
This is one way. You can convert to byte literal first before decoding to utf-8.
s = pd.Series(['hello', 'héllo', 'Zürich', 'Zurich'])
res = s.str.normalize('NFKD')\
.str.encode('ascii', errors='ignore')\
.str.decode('utf-8')
print(res)
0 hello
1 hello
2 Zurich
3 Zurich
dtype: object
pd.Series.str.normalize
uses unicodedata
module. As per the docs:
The normal form KD (NFKD) will apply the compatibility decomposition, i.e. replace all compatibility characters with their equivalents.
Try unidecode module.
Ex:
import unidecode
dataSwiss['Municipality'] = dataSwiss['Municipality'].apply(unidecode.unidecode)
Or:
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
only_ascii = nfkd_form.encode('ASCII', 'ignore')
return only_ascii
dataSwiss['Municipality'] = dataSwiss['Municipality'].apply(remove_accents)
Note: The function is from this link
Update as per comment
dataSwiss['Municipality'] = dataSwiss['Municipality'].apply(unicode).apply(remove_accents)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With