I stumbled upon a weird and inconsistent behavior for Pandas replace
function when using it to swap two values of a column. When using it to swap integers in a column we have
df = pd.DataFrame({'A': [0, 1]})
df.A.replace({0: 1, 1: 0})
This yields the result:
df
A
1
0
However, when using the same commands for string values
df = pd.DataFrame({'B': ['a', 'b']})
df.B.replace({'a': 'b', 'b': 'a'})
We get
df
B
'a'
'a'
Can anyone explain me this difference in behavior, or point me to a page in the docs that deals with inconsistencies when using integers and strings in pandas?
You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.
DataFrame. replace() function is used to replace values in column (one value with another value on all columns). This method takes to_replace, value, inplace, limit, regex and method as parameters and returns a new DataFrame. When inplace=True is used, it replaces on existing DataFrame object and returns None value.
Yup, this is definitely a bug, so I've opened a new issue - GH20656.
It looks like pandas applies the replacements successively. It makes first replacement, causing "a" to be replaced with "b", and then the second, causing both "b"s to be replaced by "a".
In summary, what you see is equivalent to
df.B.replace('a', 'b').replace('b', 'a')
0 a
1 a
Name: B, dtype: object
Which is definitely not what should be happening.
There is a workaround using str.replace
with a lambda
callback.
m = {'a': 'b', 'b': 'a'}
df.B.str.replace('|'.join(m.keys()), lambda x: m[x.group()])
0 b
1 a
Name: B, dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With