Replacing special characters in pandas dataframe

Question

So, I have this huge DF which encoded in iso8859_15.

I have a few columns which contain names and places in Brazil, so some of them contain special characters such as "í" or "Ô".

I have the key to replace them in a dictionary {'í':'i', 'á':'a', ...}

I tried replacing it a couple of ways (below), but none of them worked.

df.replace(dictionary, regex=True, inplace=True) ###BOTH WITH AND WITHOUT REGEX AND REPLACE

Also:

df.udpate(pd.Series(dic))

None of them had the expected output, which would be for strings such as "NÍCOLAS" to become "NICOLAS".

Help?

randomir · Accepted Answer

The docs on pandas.DataFrame.replace says you have to provide a nested dictionary: the first level is the column name for which you have to provide a second dictionary with substitution pairs.

So, this should work:

>>> df=pd.DataFrame({'a': ['NÍCOLAS','asdč'], 'b': [3,4]})
>>> df
         a  b
0  NÍCOLAS  3
1     asdč  4

>>> df.replace({'a': {'č': 'c', 'Í': 'I'}}, regex=True)
         a  b
0  NICOLAS  3
1     asdc  4

Edit. Seems pandas also accepts non-nested translation dictionary. In that case, the problem is probably with character encoding, particularly if you use Python 2. Assuming your CSV load function decoded the file characters properly (as true Unicode code-points), then you should take care your translation/substitution dictionary is also defined with Unicode characters, like this:

dictionary = {u'í': 'i', u'á': 'a'}

If you have a definition like this (and using Python 2):

dictionary = {'í': 'i', 'á': 'a'}

then the actual keys in that dictionary are multibyte strings. Which bytes (characters) they are depends on the actual source file character encoding used, but presuming you use UTF-8, you'll get:

dictionary = {'\xc3\xa1': 'a', '\xc3\xad': 'i'}

And that would explain why pandas fails to replace those chars. So, be sure to use Unicode literals in Python 2: u'this is unicode string'.

On the other hand, in Python 3, all strings are Unicode strings, and you don't have to use the u prefix (in fact unicode type from Python 2 is renamed to str in Python 3, and the old str from Python 2 is now bytes in Python 3).

OverflowingTheGlass · Answer

replace works out of the box without specifying a specific column in Python 3.

Load Data:

df=pd.read_csv('test.csv', sep=',', low_memory=False, encoding='iso8859_15')
df

Result:

col1    col2
0   he  hello
1   Nícolas shárk
2   welcome yes

Create Dictionary:

dictionary = {'í':'i', 'á':'a'}

Replace:

df.replace(dictionary, regex=True, inplace=True)

Result:

 col1   col2
0   he  hello
1   Nicolas shark
2   welcome yes

Replacing special characters in pandas dataframe

Tags:

python

pandas

Raphael Hernandes

2 Answers

randomir

OverflowingTheGlass

Recent Activity

Donate For Us

Replacing special characters in pandas dataframe

Tags:

python

pandas

Raphael Hernandes

2 Answers

randomir

OverflowingTheGlass

Related questions

Recent Activity

Donate For Us