Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python pandas dataframe : fill nans with a conditional mean

I have the following dataframe:

import numpy as np 
import pandas as pd
df = pd.DataFrame(data={'Cat' : ['A', 'A', 'A','B', 'B', 'A', 'B'],
                        'Vals' : [1, 2, 3, 4, 5, np.nan, np.nan]})

    Cat   Vals
0   A     1
1   A     2
2   A     3
3   B     4
4   B     5
5   A   NaN
6   B   NaN

And I want indexes 5 and 6 to be filled with the conditional mean of 'Vals' based on the 'Cat' column, namely 2 and 4.5

The following code works fine:

means = df.groupby('Cat').Vals.mean()
for i in df[df.Vals.isnull()].index:
    df.loc[i, 'Vals'] = means[df.loc[i].Cat]

    Cat   Vals
0   A     1
1   A     2
2   A     3
3   B     4
4   B     5
5   A     2
6   B   4.5

But I'm looking for something nicer, like

df.Vals.fillna(df.Vals.mean(Conditionally to column 'Cat'))

Edit: I found this, which is one line shorter, but I'm still not happy with it:

means = df.groupby('Cat').Vals.mean()
df.Vals = df.apply(lambda x: means[x.Cat] if pd.isnull(x.Vals) else x.Vals, axis=1)
like image 298
Niourf Avatar asked Oct 31 '15 22:10

Niourf


1 Answers

We wish to "associate" the Cat values with the missing NaN locations. In Pandas such associations are always done via the index. So it is natural to set Cat as the index:

df = df.set_index(['Cat'])

Once this is done, then fillna works as desired:

df['Vals'] = df['Vals'].fillna(means)

To return Cat to a column, you could then of course use reset_index:

df = df.reset_index()

import pandas as pd
import numpy as np
df = pd.DataFrame(
    {'Cat' : ['A', 'A', 'A','B', 'B', 'A', 'B'], 
     'Vals' : [1, 2, 3, 4, 5, np.nan, np.nan]})

means = df.groupby(['Cat'])['Vals'].mean()
df = df.set_index(['Cat'])
df['Vals'] = df['Vals'].fillna(means)
df = df.reset_index()
print(df)

yields

  Cat  Vals
0   A   1.0
1   A   2.0
2   A   3.0
3   B   4.0
4   B   5.0
5   A   2.0
6   B   4.5
like image 61
unutbu Avatar answered Oct 22 '22 01:10

unutbu