Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing special characters in pandas dataframe

Tags:

python

pandas

So, I have this huge DF which encoded in iso8859_15.

I have a few columns which contain names and places in Brazil, so some of them contain special characters such as "í" or "Ô".

I have the key to replace them in a dictionary {'í':'i', 'á':'a', ...}

I tried replacing it a couple of ways (below), but none of them worked.

df.replace(dictionary, regex=True, inplace=True) ###BOTH WITH AND WITHOUT REGEX AND REPLACE

Also:

df.udpate(pd.Series(dic))

None of them had the expected output, which would be for strings such as "NÍCOLAS" to become "NICOLAS".

Help?

like image 908
Raphael Hernandes Avatar asked Aug 09 '17 16:08

Raphael Hernandes


2 Answers

The docs on pandas.DataFrame.replace says you have to provide a nested dictionary: the first level is the column name for which you have to provide a second dictionary with substitution pairs.

So, this should work:

>>> df=pd.DataFrame({'a': ['NÍCOLAS','asdč'], 'b': [3,4]})
>>> df
         a  b
0  NÍCOLAS  3
1     asdč  4

>>> df.replace({'a': {'č': 'c', 'Í': 'I'}}, regex=True)
         a  b
0  NICOLAS  3
1     asdc  4

Edit. Seems pandas also accepts non-nested translation dictionary. In that case, the problem is probably with character encoding, particularly if you use Python 2. Assuming your CSV load function decoded the file characters properly (as true Unicode code-points), then you should take care your translation/substitution dictionary is also defined with Unicode characters, like this:

dictionary = {u'í': 'i', u'á': 'a'}

If you have a definition like this (and using Python 2):

dictionary = {'í': 'i', 'á': 'a'}

then the actual keys in that dictionary are multibyte strings. Which bytes (characters) they are depends on the actual source file character encoding used, but presuming you use UTF-8, you'll get:

dictionary = {'\xc3\xa1': 'a', '\xc3\xad': 'i'}

And that would explain why pandas fails to replace those chars. So, be sure to use Unicode literals in Python 2: u'this is unicode string'.

On the other hand, in Python 3, all strings are Unicode strings, and you don't have to use the u prefix (in fact unicode type from Python 2 is renamed to str in Python 3, and the old str from Python 2 is now bytes in Python 3).

like image 140
randomir Avatar answered Sep 19 '22 20:09

randomir


replace works out of the box without specifying a specific column in Python 3.

Load Data:

df=pd.read_csv('test.csv', sep=',', low_memory=False, encoding='iso8859_15')
df

Result:

col1    col2
0   he  hello
1   Nícolas shárk
2   welcome yes

Create Dictionary:

dictionary = {'í':'i', 'á':'a'}

Replace:

df.replace(dictionary, regex=True, inplace=True)

Result:

 col1   col2
0   he  hello
1   Nicolas shark
2   welcome yes
like image 26
OverflowingTheGlass Avatar answered Sep 17 '22 20:09

OverflowingTheGlass