I have a dataframe with two features: gps_height (numeric) and region (categorical).
The gps_height contains a lot of 0 values, which are missing values in this case. I want to fill the 0 values with the mean of the coherent region.
My reasoning is as follows: 1. Drop the zero values and take the mean values of gps_height, grouped by region
df[df.gps_height !=0].groupby(['region']).mean()
But how do I replace the zero values in my dataframe with those mean values?
Sample data:
gps_height region 0 1390 Iringa 1 1400 Mara 2 0 Iringa 3 250 Iringa ...
Use:
df = pd.DataFrame({'region':list('aaabbbccc'),
'gps_height':[2,3,0,3,4,5,1,0,0]})
print (df)
region gps_height
0 a 2
1 a 3
2 a 0
3 b 3
4 b 4
5 b 5
6 c 1
7 c 0
8 c 0
Replace 0
to missing values, and then replace NAN
s by fillna
with mean
s by GroupBy.transform
per groups:
df['gps_height'] = df['gps_height'].replace(0, np.nan)
df['gps_height']=df['gps_height'].fillna(df.groupby('region')['gps_height'].transform('mean'))
print (df)
region gps_height
0 a 2.0
1 a 3.0
2 a 2.5
3 b 3.0
4 b 4.0
5 b 5.0
6 c 1.0
7 c 1.0
8 c 1.0
Or filter out 0
values, aggregate means
and map all 0
rows:
m = df['gps_height'] != 0
s = df[m].groupby('region')['gps_height'].mean()
df.loc[~m, 'gps_height'] = df['region'].map(s)
#alternative
#df['gps_height'] = np.where(~m, df['region'].map(s), df['gps_height'])
print (df)
region gps_height
0 a 2.0
1 a 3.0
2 a 2.5
3 b 3.0
4 b 4.0
5 b 5.0
6 c 1.0
7 c 1.0
8 c 1.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With