Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove outliers in Pandas dataframe with groupby

Tags:

python

pandas

I have a dataframe of Report Date, Time Interval and Total Volume for a full year. I would like to be able to remove outliers within each Time Interval.

This is as far as I've been able to get...

dft.head()

    Report Date Time Interval   Total Volume
5784    2016-03-01  24  467.0
5785    2016-03-01  25  580.0
5786    2016-03-01  26  716.0
5787    2016-03-01  27  803.0
5788    2016-03-01  28  941.0

So i calculate the quantile's

low = .05
high = .95
dfq = dft.groupby(['Time Interval']).quantile([low, high])
print(dfq).head()

                    Total Volume
Time Interval                   
24            0.05        420.15
              0.95        517.00
25            0.05        521.90
              0.95        653.55
26            0.05        662.75

And then I'd like to be able to use them to remove outliers within each Time Interval using something like this...

dft = dft.apply(lambda x: x[(x>dfq.loc[low,x.name]) & (x < dfq.loc[high,x.name])], axis=0)
like image 416
zookman Avatar asked Mar 07 '23 16:03

zookman


2 Answers

One way is to filter out as follows:

In [11]: res = df.groupby("Date")["Interval"].quantile([0.05, 0.95]).unstack(level=1)

In [12]: res
Out[12]:
             0.05   0.95
Date
2016-03-01  489.6  913.4

Now we can lookup these values for each row using loc and filter:

In [13]: (res.loc[df.Date, 0.05] < df.Interval.values) & (df.Interval.values < res.loc[df.Date, 0.95])
Out[13]:
Date
2016-03-01    False
2016-03-01     True
2016-03-01     True
2016-03-01     True
2016-03-01    False
dtype: bool

In [14]: df.loc[((res.loc[df.Date, 0.05] < df.Interval.values) & (df.Interval.values < res.loc[df.Date, 0.95])).values]
Out[14]:
   Report        Date  Time  Interval  Total Volume
1    5785  2016-03-01    25     580.0           NaN
2    5786  2016-03-01    26     716.0           NaN
3    5787  2016-03-01    27     803.0           NaN

Note: grouping by 'Time Interval' will work the same, but in your example doesn't filter any rows!

like image 167
Andy Hayden Avatar answered Mar 19 '23 04:03

Andy Hayden


df[df.groupby("ReportDate").TotalVolume.\
      transform(lambda x : (x<x.quantile(0.95))&(x>(x.quantile(0.05)))).eq(1)]
Out[1033]: 
      ReportDate  TimeInterval  TotalVolume
5785  2016-03-01            25        580.0
5786  2016-03-01            26        716.0
5787  2016-03-01            27        803.0
like image 41
BENY Avatar answered Mar 19 '23 04:03

BENY