Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to replace values in multiple categoricals in a pandas DataFrame

I want to replace certain values in a dataframe containing multiple categoricals.

df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')

If I apply .replace on a single column, the result is as expected:

>>> df.s1.replace('a', 1)
0    1
1    b
2    c
Name: s1, dtype: object

If I apply the same operation to the whole dataframe, an error is shown (short version):

>>> df.replace('a', 1)
ValueError: Cannot setitem on a Categorical with a new category, set the categories first

During handling of the above exception, another exception occurred:
ValueError: Wrong number of dimensions

If the dataframe contains integers as categories, the following happens:

df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')

>>> df.replace(1, 3)
    s1  s2
0   3   3
1   2   3
2   3   4

But,

>>> df.replace(1, 2)
ValueError: Wrong number of dimensions

What am I missing?

like image 504
tobiasraabe Avatar asked Feb 15 '18 12:02

tobiasraabe


People also ask

How do you replace categorical values in Python?

Method 1: Using replace() method Replacing is one of the methods to convert categorical terms into numeric. For example, We will take a dataset of people's salaries based on their level of education. This is an ordinal type of categorical variable. We will convert their education levels into numeric terms.

How replace column values in pandas based on multiple conditions?

You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.

How do you replace a specific value in a Pandas DataFrame?

DataFrame. replace() function is used to replace values in column (one value with another value on all columns). This method takes to_replace, value, inplace, limit, regex and method as parameters and returns a new DataFrame. When inplace=True is used, it replaces on existing DataFrame object and returns None value.


2 Answers

Without digging, that seems to be buggy to me.

My Work Around
pd.DataFrame.apply with pd.Series.replace
This has the advantage that you don't need to mess with changing any types.

df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')
df.apply(pd.Series.replace, to_replace=1, value=2)

  s1  s2
0  2   2
1  2   3
2  3   4

Or

df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.apply(pd.Series.replace, to_replace='a', value=1)

  s1 s2
0  1  1
1  b  c
2  c  d

@cᴏʟᴅsᴘᴇᴇᴅ's Work Around

df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.applymap(str).replace('a', 1)

  s1 s2
0  1  1
1  b  c
2  c  d
like image 61
piRSquared Avatar answered Oct 21 '22 01:10

piRSquared


The reason for such behavior is different set of categorical values for each column:

In [224]: df.s1.cat.categories
Out[224]: Index(['a', 'b', 'c'], dtype='object')

In [225]: df.s2.cat.categories
Out[225]: Index(['a', 'c', 'd'], dtype='object')

so if you will replace to a value that is in both categories it'll work:

In [226]: df.replace('d','a')
Out[226]:
  s1 s2
0  a  a
1  b  c
2  c  a

As a solution you might want to make your columns categorical manually, using:

pd.Categorical(..., categories=[...])

where categories would have all possible values for all columns...

like image 43
MaxU - stop WAR against UA Avatar answered Oct 21 '22 01:10

MaxU - stop WAR against UA