I have an incomplete dataframe, incomplete_df
, as below. I want to impute the missing amount
s with the average amount
of the corresponding id
. If the average for that specific id
is itself NaN (see id=4
), I want to use the overall average.
Below are the example data and my highly inefficient solution:
import pandas as pd
import numpy as np
incomplete_df = pd.DataFrame({'id': [1,2,3,2,2,3,1,1,1,2,4],
'type': ['one', 'one', 'two', 'three', 'two', 'three', 'one', 'two', 'one', 'three','one'],
'amount': [345,928,np.NAN,645,113,942,np.NAN,539,np.NAN,814,np.NAN]
}, columns=['id','type','amount'])
# Forrest Gump Solution
for idx in incomplete_df.index[np.isnan(incomplete_df.amount)]: # loop through all rows with amount = NaN
cur_id = incomplete_df.loc[idx, 'id']
if (cur_id in means.index ):
incomplete_df.loc[idx, 'amount'] = means.loc[cur_id]['amount'] # average amount of that specific id.
else:
incomplete_df.loc[idx, 'amount'] = np.mean(means.amount) # average amount across all id's
What is the fastest and the most pythonic/pandonic way to achieve this?
Pandas treat None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame : isnull() notnull()
Pandas DataFrame fillna() MethodThe fillna() method replaces the NULL values with a specified value. The fillna() method returns a new DataFrame object unless the inplace parameter is set to True , in that case the fillna() method does the replacing in the original DataFrame instead.
Using Dataframe.fillna() from the pandas' library, we can easily replace the 'NaN' in the data frame. Procedure: To calculate the mean() we use the mean function of the particular column. Now with the help of fillna() function we will change all 'NaN' of that particular column for which we have its mean.
Disclaimer: I'm not really interested in the fastest solution but the most pandorable.
Here, I think that would be something like:
>>> df["amount"].fillna(df.groupby("id")["amount"].transform("mean"), inplace=True)
>>> df["amount"].fillna(df["amount"].mean(), inplace=True)
which produces
>>> df
id type amount
0 1 one 345.0
1 2 one 928.0
2 3 two 942.0
3 2 three 645.0
4 2 two 113.0
5 3 three 942.0
6 1 one 442.0
7 1 two 539.0
8 1 one 442.0
9 2 three 814.0
10 4 one 615.2
[11 rows x 3 columns]
There are lots of obvious tweaks depending upon exactly how you want the chained imputation process to go.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With