Remove outliers in Pandas dataframe with groupby

Question

I have a dataframe of Report Date, Time Interval and Total Volume for a full year. I would like to be able to remove outliers within each Time Interval.

This is as far as I've been able to get...

dft.head()

    Report Date Time Interval   Total Volume
5784    2016-03-01  24  467.0
5785    2016-03-01  25  580.0
5786    2016-03-01  26  716.0
5787    2016-03-01  27  803.0
5788    2016-03-01  28  941.0

So i calculate the quantile's

low = .05
high = .95
dfq = dft.groupby(['Time Interval']).quantile([low, high])
print(dfq).head()

                    Total Volume
Time Interval                   
24            0.05        420.15
              0.95        517.00
25            0.05        521.90
              0.95        653.55
26            0.05        662.75

And then I'd like to be able to use them to remove outliers within each Time Interval using something like this...

dft = dft.apply(lambda x: x[(x>dfq.loc[low,x.name]) & (x < dfq.loc[high,x.name])], axis=0)

Andy Hayden · Accepted Answer

One way is to filter out as follows:

In [11]: res = df.groupby("Date")["Interval"].quantile([0.05, 0.95]).unstack(level=1)

In [12]: res
Out[12]:
             0.05   0.95
Date
2016-03-01  489.6  913.4

Now we can lookup these values for each row using loc and filter:

In [13]: (res.loc[df.Date, 0.05] < df.Interval.values) & (df.Interval.values < res.loc[df.Date, 0.95])
Out[13]:
Date
2016-03-01    False
2016-03-01     True
2016-03-01     True
2016-03-01     True
2016-03-01    False
dtype: bool

In [14]: df.loc[((res.loc[df.Date, 0.05] < df.Interval.values) & (df.Interval.values < res.loc[df.Date, 0.95])).values]
Out[14]:
   Report        Date  Time  Interval  Total Volume
1    5785  2016-03-01    25     580.0           NaN
2    5786  2016-03-01    26     716.0           NaN
3    5787  2016-03-01    27     803.0           NaN

Note: grouping by 'Time Interval' will work the same, but in your example doesn't filter any rows!

BENY · Answer

df[df.groupby("ReportDate").TotalVolume.\
      transform(lambda x : (x<x.quantile(0.95))&(x>(x.quantile(0.05)))).eq(1)]
Out[1033]: 
      ReportDate  TimeInterval  TotalVolume
5785  2016-03-01            25        580.0
5786  2016-03-01            26        716.0
5787  2016-03-01            27        803.0

Remove outliers in Pandas dataframe with groupby

Tags:

python

pandas

zookman

2 Answers

Andy Hayden

BENY

Recent Activity

Donate For Us

Remove outliers in Pandas dataframe with groupby

Tags:

python

pandas

zookman

2 Answers

Andy Hayden

BENY

Related questions

Recent Activity

Donate For Us