I have a dataframe of Report Date, Time Interval and Total Volume for a full year. I would like to be able to remove outliers within each Time Interval.
This is as far as I've been able to get...
dft.head()
Report Date Time Interval Total Volume
5784 2016-03-01 24 467.0
5785 2016-03-01 25 580.0
5786 2016-03-01 26 716.0
5787 2016-03-01 27 803.0
5788 2016-03-01 28 941.0
So i calculate the quantile's
low = .05
high = .95
dfq = dft.groupby(['Time Interval']).quantile([low, high])
print(dfq).head()
Total Volume
Time Interval
24 0.05 420.15
0.95 517.00
25 0.05 521.90
0.95 653.55
26 0.05 662.75
And then I'd like to be able to use them to remove outliers within each Time Interval using something like this...
dft = dft.apply(lambda x: x[(x>dfq.loc[low,x.name]) & (x < dfq.loc[high,x.name])], axis=0)
One way is to filter out as follows:
In [11]: res = df.groupby("Date")["Interval"].quantile([0.05, 0.95]).unstack(level=1)
In [12]: res
Out[12]:
0.05 0.95
Date
2016-03-01 489.6 913.4
Now we can lookup these values for each row using loc
and filter:
In [13]: (res.loc[df.Date, 0.05] < df.Interval.values) & (df.Interval.values < res.loc[df.Date, 0.95])
Out[13]:
Date
2016-03-01 False
2016-03-01 True
2016-03-01 True
2016-03-01 True
2016-03-01 False
dtype: bool
In [14]: df.loc[((res.loc[df.Date, 0.05] < df.Interval.values) & (df.Interval.values < res.loc[df.Date, 0.95])).values]
Out[14]:
Report Date Time Interval Total Volume
1 5785 2016-03-01 25 580.0 NaN
2 5786 2016-03-01 26 716.0 NaN
3 5787 2016-03-01 27 803.0 NaN
Note: grouping by 'Time Interval' will work the same, but in your example doesn't filter any rows!
df[df.groupby("ReportDate").TotalVolume.\
transform(lambda x : (x<x.quantile(0.95))&(x>(x.quantile(0.05)))).eq(1)]
Out[1033]:
ReportDate TimeInterval TotalVolume
5785 2016-03-01 25 580.0
5786 2016-03-01 26 716.0
5787 2016-03-01 27 803.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With