Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python-pandas Replace NA with the median or mean of a group in dataframe

Suppose we have a df:

    A       B
   apple   1.0
   apple   2.0
   apple    NA
   orange   NA
   orange  7.0
   melon   14.0
   melon   NA
   melon   15.0
   melon   16.0

to replace the NA, we can use df["B"].fillna(df["B"].median()), but it will fill NA with the median of all data in "B"

Is there any way that we can use the median of a certain A to replace the NA (like below):

    A       B
   apple   1.0
   apple   2.0
   apple   **1.5**
   orange  **7.0**
   orange  7.0
   melon   14.0
   melon   **15.0**
   melon   15.0
   melon   16.0

Thanks!

like image 245
Robin1988 Avatar asked Nov 06 '15 18:11

Robin1988


2 Answers

In pandas you may use transform to obtain null-fill values:

>>> med = df.groupby('A')['B'].transform('median')
>>> df['B'].fillna(med)
0     1.0
1     2.0
2     1.5
3     7.0
4     7.0
5    14.0
6    15.0
7    15.0
8    16.0
Name: B, dtype: float64
like image 193
behzad.nouri Avatar answered Oct 13 '22 01:10

behzad.nouri


In R, can use na.aggregate/data.table to replace the NA by mean value of the group. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'A', apply the na.aggregate on 'B'.

library(zoo)
library(data.table)
setDT(df)[,  B:= na.aggregate(B), A]
df
#      A    B
#1:  apple  1.0
#2:  apple  2.0
#3:  apple  1.5
#4: orange  7.0
#5: orange  7.0
#6:  melon 14.0
#7:  melon 15.0
#8:  melon 15.0
#9:  melon 16.0
like image 35
akrun Avatar answered Oct 13 '22 00:10

akrun