So, I have this huge DF which encoded in iso8859_15.
I have a few columns which contain names and places in Brazil, so some of them contain special characters such as "í" or "Ô".
I have the key to replace them in a dictionary {'í':'i', 'á':'a', ...}
I tried replacing it a couple of ways (below), but none of them worked.
df.replace(dictionary, regex=True, inplace=True) ###BOTH WITH AND WITHOUT REGEX AND REPLACE
Also:
df.udpate(pd.Series(dic))
None of them had the expected output, which would be for strings such as "NÍCOLAS" to become "NICOLAS".
Help?
The docs on pandas.DataFrame.replace
says you have to provide a nested dictionary: the first level is the column name for which you have to provide a second dictionary with substitution pairs.
So, this should work:
>>> df=pd.DataFrame({'a': ['NÍCOLAS','asdč'], 'b': [3,4]})
>>> df
a b
0 NÍCOLAS 3
1 asdč 4
>>> df.replace({'a': {'č': 'c', 'Í': 'I'}}, regex=True)
a b
0 NICOLAS 3
1 asdc 4
Edit. Seems pandas
also accepts non-nested translation dictionary. In that case, the problem is probably with character encoding, particularly if you use Python 2. Assuming your CSV load function decoded the file characters properly (as true Unicode code-points), then you should take care your translation/substitution dictionary is also defined with Unicode characters, like this:
dictionary = {u'í': 'i', u'á': 'a'}
If you have a definition like this (and using Python 2):
dictionary = {'í': 'i', 'á': 'a'}
then the actual keys in that dictionary are multibyte strings. Which bytes (characters) they are depends on the actual source file character encoding used, but presuming you use UTF-8, you'll get:
dictionary = {'\xc3\xa1': 'a', '\xc3\xad': 'i'}
And that would explain why pandas
fails to replace those chars. So, be sure to use Unicode literals in Python 2: u'this is unicode string'
.
On the other hand, in Python 3, all strings are Unicode strings, and you don't have to use the u
prefix (in fact unicode
type from Python 2 is renamed to str
in Python 3, and the old str
from Python 2 is now bytes
in Python 3).
replace
works out of the box without specifying a specific column in Python 3.
Load Data:
df=pd.read_csv('test.csv', sep=',', low_memory=False, encoding='iso8859_15')
df
Result:
col1 col2
0 he hello
1 Nícolas shárk
2 welcome yes
Create Dictionary:
dictionary = {'í':'i', 'á':'a'}
Replace:
df.replace(dictionary, regex=True, inplace=True)
Result:
col1 col2
0 he hello
1 Nicolas shark
2 welcome yes
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With