I want to replace certain values in a dataframe containing multiple categoricals.
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
If I apply .replace
on a single column, the result is as expected:
>>> df.s1.replace('a', 1)
0 1
1 b
2 c
Name: s1, dtype: object
If I apply the same operation to the whole dataframe, an error is shown (short version):
>>> df.replace('a', 1)
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
During handling of the above exception, another exception occurred:
ValueError: Wrong number of dimensions
If the dataframe contains integers as categories, the following happens:
df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')
>>> df.replace(1, 3)
s1 s2
0 3 3
1 2 3
2 3 4
But,
>>> df.replace(1, 2)
ValueError: Wrong number of dimensions
What am I missing?
Method 1: Using replace() method Replacing is one of the methods to convert categorical terms into numeric. For example, We will take a dataset of people's salaries based on their level of education. This is an ordinal type of categorical variable. We will convert their education levels into numeric terms.
You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.
DataFrame. replace() function is used to replace values in column (one value with another value on all columns). This method takes to_replace, value, inplace, limit, regex and method as parameters and returns a new DataFrame. When inplace=True is used, it replaces on existing DataFrame object and returns None value.
Without digging, that seems to be buggy to me.
My Work Aroundpd.DataFrame.apply
with pd.Series.replace
This has the advantage that you don't need to mess with changing any types.
df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')
df.apply(pd.Series.replace, to_replace=1, value=2)
s1 s2
0 2 2
1 2 3
2 3 4
Or
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.apply(pd.Series.replace, to_replace='a', value=1)
s1 s2
0 1 1
1 b c
2 c d
@cᴏʟᴅsᴘᴇᴇᴅ's Work Around
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.applymap(str).replace('a', 1)
s1 s2
0 1 1
1 b c
2 c d
The reason for such behavior is different set of categorical values for each column:
In [224]: df.s1.cat.categories
Out[224]: Index(['a', 'b', 'c'], dtype='object')
In [225]: df.s2.cat.categories
Out[225]: Index(['a', 'c', 'd'], dtype='object')
so if you will replace to a value that is in both categories it'll work:
In [226]: df.replace('d','a')
Out[226]:
s1 s2
0 a a
1 b c
2 c a
As a solution you might want to make your columns categorical manually, using:
pd.Categorical(..., categories=[...])
where categories would have all possible values for all columns...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With