I will explain my problem statement:
Suppose I have train data and test data. I have NaN values in same columns for both train and test. Now my strategy for the nan imputation is this: Groupby some column and fill the nans with mean of that group. Example:
x_train = pd.DataFrame({
'Occupation': ['driver', 'driver', 'mechanic', 'teacher', 'mechanic', 'teacher',
'unemployed', 'driver', 'mechanic', 'teacher'],
'salary': [100, 150, 70, 300, 90, 250, 10, 90, 110, 350],
'expenditure': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120]})
Occupation salary expenditure
0 driver 100 20.0
1 driver 150 40.0
2 mechanic 70 10.0
3 teacher 300 100.0
4 mechanic 90 NaN
5 teacher 250 80.0
6 unemployed 10 0.0
7 driver 90 NaN
8 mechanic 110 40.0
9 teacher 350 120.0
For train data I can do this like this:
x_train['expenditure'] = x_train.groupby('Occupation')['expenditure'].transform(lambda x:x.fillna(x.mean())
But how do I do something like this for test data. Where the mean would be the training group's mean. I am trying to do this using for loop, but it's taking forever.
Create mean to Series:
mean = x_train.groupby('Occupation')['expenditure'].mean()
print (mean)
Occupation
driver 30.0
mechanic 25.0
teacher 100.0
unemployed 0.0
Name: expenditure, dtype: float64
And then replace missing values by Series.map and Series.fillna:
x_train['expenditure'] = x_train['expenditure'].fillna(x_train['Occupation'].map(mean))
print (x_train)
Occupation salary expenditure
0 driver 100 20.0
1 driver 150 40.0
2 mechanic 70 10.0
3 teacher 300 100.0
4 mechanic 90 25.0
5 teacher 250 80.0
6 unemployed 10 0.0
7 driver 90 30.0
8 mechanic 110 40.0
9 teacher 350 120.0
And same way use for test data:
x_test['expenditure'] = x_test['expenditure'].fillna(x_test['Occupation'].map(mean))
EDIT:
Solution for multiple columns - instead map use DataFrame.join:
x_train = pd.DataFrame({
'Occupation': ['driver', 'driver', 'mechanic', 'teacher', 'mechanic', 'teacher',
'unemployed', 'driver', 'mechanic', 'teacher'],
'salary': [100, 150, 70, 300, 90, 250, 10, 90, 110, 350],
'expenditure': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120],
'expenditure1': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120],
'col':list('aabbddeehh')})
mean = x_train.groupby('Occupation').mean()
print (mean)
salary expenditure expenditure1
Occupation
driver 113.333333 30.0 30.0
mechanic 90.000000 25.0 25.0
teacher 300.000000 100.0 100.0
unemployed 10.000000 0.0 0.0
x_train = x_train.fillna(x_train[['Occupation']].join(mean, on='Occupation'))
print (x_train)
Occupation salary expenditure expenditure1 col
0 driver 100 20.0 20.0 a
1 driver 150 40.0 40.0 a
2 mechanic 70 10.0 10.0 b
3 teacher 300 100.0 100.0 b
4 mechanic 90 25.0 25.0 d
5 teacher 250 80.0 80.0 d
6 unemployed 10 0.0 0.0 e
7 driver 90 30.0 30.0 e
8 mechanic 110 40.0 40.0 h
9 teacher 350 120.0 120.0 h
x_test = x_test.fillna(x_test[['Occupation']].join(mean, on='Occupation'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With