Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Computing MAD(mean absolute deviation) GroupBy Pandas

I have a dataframe:

Type Name Cost
  A   X    545
  B   Y    789
  C   Z    477
  D   X    640
  C   X    435
  B   Z    335
  A   X    850
  B   Y    152

I have all such combinations in my dataframe with Type ['A','B','C','D'] and Names ['X','Y','Z'] . I used the groupby method to get stats on a specific combination together like A-X , A-Y , A-Z .Here's some code:

df = pd.DataFrame({'Type':['A','B','C','D','C','B','A','B'] ,'Name':['X','Y','Z','X','X','Z','X','Y'], 'Cost':[545,789,477,640,435,335,850,152]})
df.groupby(['Name','Type']).agg([mean,std])  
#need to use mad instead of std  

I need to eliminate the observations that are more than 3 MADs away ; something like:

test = df[np.abs(df.Cost-df.Cost.mean())<=(3*df.Cost.mad())]

I am confused with this as df.Cost.mad() returns the MAD for the Cost on the entire data rather than a specific Type-Name category. How could I combine both?

like image 434
Hypothetical Ninja Avatar asked Apr 24 '15 11:04

Hypothetical Ninja


1 Answers

You can use groupby and transform to create new data series that can be used to filter out your data.

groups = df.groupby(['Name','Type'])
mad = groups['Cost'].transform(lambda x: x.mad())
dif = groups['Cost'].transform(lambda x: np.abs(x - x.mean()))
df2 = df[dif <= 3*mad]

However, in this case, no row is filtered out since the difference is equal to the mean absolute deviation (the groups have only two rows at most).

like image 180
Julien Spronck Avatar answered Nov 18 '22 23:11

Julien Spronck