Filling NaN values in pandas using Train Data Statistics

Question

I will explain my problem statement:

Suppose I have train data and test data. I have NaN values in same columns for both train and test. Now my strategy for the nan imputation is this: Groupby some column and fill the nans with mean of that group. Example:

x_train = pd.DataFrame({
'Occupation': ['driver', 'driver', 'mechanic', 'teacher', 'mechanic', 'teacher',
    'unemployed', 'driver', 'mechanic', 'teacher'],
'salary': [100, 150, 70, 300, 90, 250, 10, 90, 110, 350],
'expenditure': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120]})

    Occupation  salary  expenditure
0   driver      100     20.0    
1   driver      150     40.0    
2   mechanic    70      10.0
3   teacher     300     100.0
4   mechanic    90      NaN 
5   teacher     250     80.0
6   unemployed  10      0.0  
7   driver      90      NaN   
8   mechanic    110     40.0    
9   teacher     350     120.0

For train data I can do this like this:

x_train['expenditure'] = x_train.groupby('Occupation')['expenditure'].transform(lambda x:x.fillna(x.mean())

But how do I do something like this for test data. Where the mean would be the training group's mean. I am trying to do this using for loop, but it's taking forever.

jezrael · Accepted Answer

Create mean to Series:

mean = x_train.groupby('Occupation')['expenditure'].mean()
print (mean)
Occupation
driver         30.0
mechanic       25.0
teacher       100.0
unemployed      0.0
Name: expenditure, dtype: float64

And then replace missing values by Series.map and Series.fillna:

x_train['expenditure'] = x_train['expenditure'].fillna(x_train['Occupation'].map(mean))
print (x_train)
   Occupation  salary  expenditure
0      driver     100         20.0
1      driver     150         40.0
2    mechanic      70         10.0
3     teacher     300        100.0
4    mechanic      90         25.0
5     teacher     250         80.0
6  unemployed      10          0.0
7      driver      90         30.0
8    mechanic     110         40.0
9     teacher     350        120.0

And same way use for test data:

x_test['expenditure'] = x_test['expenditure'].fillna(x_test['Occupation'].map(mean))

EDIT:

Solution for multiple columns - instead map use DataFrame.join:

x_train = pd.DataFrame({
'Occupation': ['driver', 'driver', 'mechanic', 'teacher', 'mechanic', 'teacher',
    'unemployed', 'driver', 'mechanic', 'teacher'],
'salary': [100, 150, 70, 300, 90, 250, 10, 90, 110, 350],
'expenditure': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120],
'expenditure1': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120],
'col':list('aabbddeehh')})
    
mean = x_train.groupby('Occupation').mean()
print (mean)
                salary  expenditure  expenditure1
Occupation                                       
driver      113.333333         30.0          30.0
mechanic     90.000000         25.0          25.0
teacher     300.000000        100.0         100.0
unemployed   10.000000          0.0           0.0

x_train = x_train.fillna(x_train[['Occupation']].join(mean, on='Occupation'))
print (x_train)
   Occupation  salary  expenditure  expenditure1 col
0      driver     100         20.0          20.0   a
1      driver     150         40.0          40.0   a
2    mechanic      70         10.0          10.0   b
3     teacher     300        100.0         100.0   b
4    mechanic      90         25.0          25.0   d
5     teacher     250         80.0          80.0   d
6  unemployed      10          0.0           0.0   e
7      driver      90         30.0          30.0   e
8    mechanic     110         40.0          40.0   h
9     teacher     350        120.0         120.0   h

x_test = x_test.fillna(x_test[['Occupation']].join(mean, on='Occupation'))

Filling NaN values in pandas using Train Data Statistics

Tags:

pandas

dataframe

group-by

Rishabh Rao

1 Answers

jezrael

Recent Activity

Donate For Us

Filling NaN values in pandas using Train Data Statistics

Tags:

pandas

dataframe

group-by

Rishabh Rao

1 Answers

jezrael

Related questions

Recent Activity

Donate For Us