I want to remove outliers based on percentile 99 values by group wise.
import pandas as pd
df = pd.DataFrame({'Group': ['A','A','A','B','B','B','B'], 'count': [1.1,11.2,1.1,3.3,3.40,3.3,100.0]})
in output i want to remove 11.2 from group A and 100 from group b. so in final dataset there will only be 5 observations.
wantdf = pd.DataFrame({'Group': ['A','A','B','B','B'], 'count': [1.1,1.1,3.3,3.40,3.3]})
I have tried this one but I'm not getting the desired results
df[df.groupby("Group")['count'].transform(lambda x : (x<x.quantile(0.99))&(x>(x.quantile(0.01)))).eq(1)]
Here is my solution:
def is_outlier(s):
lower_limit = s.mean() - (s.std() * 3)
upper_limit = s.mean() + (s.std() * 3)
return ~s.between(lower_limit, upper_limit)
df = df[~df.groupby('Group')['count'].apply(is_outlier)]
You can write your own is_outlier function
I don't think you want to use quantile, as you'll exclude your lower values:
import pandas as pd
df = pd.DataFrame({'Group': ['A','A','A','B','B','B','B'], 'count': [1.1,11.2,1.1,3.3,3.40,3.3,100.0]})
print(pd.DataFrame(df.groupby('Group').quantile(.01)['count']))
output:
count
Group
A 1.1
B 3.3
Those aren't outliers, right? So you wouldn't want to exclude them.
You could try setting left and right limits by using standard deviations from the median maybe? This is a bit verbose, but it gives you the right answer:
left = pd.DataFrame(df.groupby('Group').median() - pd.DataFrame(df.groupby('Group').std()))
right = pd.DataFrame(df.groupby('Group').median() + pd.DataFrame(df.groupby('Group').std()))
left.columns = ['left']
right.columns = ['right']
df = df.merge(left, left_on='Group', right_index=True)
df = df.merge(right, left_on='Group', right_index=True)
df = df[(df['count'] > df['left']) & (df['count'] < df['right'])]
df = df.drop(['left', 'right'], axis=1)
print(df)
output:
Group count
0 A 1.1
2 A 1.1
3 B 3.3
4 B 3.4
5 B 3.3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With