I have a dataframe:
Type Name Cost
A X 545
B Y 789
C Z 477
D X 640
C X 435
B Z 335
A X 850
B Y 152
I have all such combinations in my dataframe with Type ['A','B','C','D'] and Names ['X','Y','Z'] . I used the groupby method to get stats on a specific combination together like A-X , A-Y , A-Z .Here's some code:
df = pd.DataFrame({'Type':['A','B','C','D','C','B','A','B'] ,'Name':['X','Y','Z','X','X','Z','X','Y'], 'Cost':[545,789,477,640,435,335,850,152]})
df.groupby(['Name','Type']).agg([mean,std])
#need to use mad instead of std
I need to eliminate the observations that are more than 3 MADs away ; something like:
test = df[np.abs(df.Cost-df.Cost.mean())<=(3*df.Cost.mad())]
I am confused with this as df.Cost.mad() returns the MAD for the Cost on the entire data rather than a specific Type-Name category. How could I combine both?
You can use groupby
and transform
to create new data series that can be used to filter out your data.
groups = df.groupby(['Name','Type'])
mad = groups['Cost'].transform(lambda x: x.mad())
dif = groups['Cost'].transform(lambda x: np.abs(x - x.mean()))
df2 = df[dif <= 3*mad]
However, in this case, no row is filtered out since the difference is equal to the mean absolute deviation (the groups have only two rows at most).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With