Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way to remove outliers by group in large pandas DataFrame [duplicate]

Tags:

python

pandas

I have a relatively large DataFrame object (about a million rows, hundreds of columns), and I'd like to clip outliers in each column by group. By "clip outliers for each column by group" I mean - compute the 5% and 95% quantiles for each column in a group and clip values outside this quantile range.

Here's the setup I'm currently using:

def winsorize_series(s):
    q = s.quantile([0.05, 0.95])
    if isinstance(q, pd.Series) and len(q) == 2:
        s[s < q.iloc[0]] = q.iloc[0]
        s[s > q.iloc[1]] = q.iloc[1]
    return s

def winsorize_df(df):
    return df.apply(winsorize_series, axis=0)

and then, with my DataFrame called features and indexed by DATE, I can do

grouped = features.groupby(level='DATE')
result = grouped.apply(winsorize_df)

This works, except that it's very slow, presumably due to the nested apply calls: one on each group, and then one for each column in each group. I tried getting rid of the second apply by computing quantiles for all columns at once, but got stuck trying to threshold each column by a different value. Is there a faster way to accomplish this procedure?

like image 254
ytsaig Avatar asked Dec 11 '14 13:12

ytsaig


2 Answers

There is a winsorize function in scipy.stats.mstats which you might consider using. Note however, that it returns slightly different values than winsorize_series:

In [126]: winsorize_series(pd.Series(range(20), dtype='float'))[0]
Out[126]: 0.95000000000000007

In [127]: mstats.winsorize(pd.Series(range(20), dtype='float'), limits=[0.05, 0.05])[0]
Out[127]: 1.0

Using mstats.winsorize instead of winsorize_series is maybe (depending on N, M, P) ~1.5x faster:

import numpy as np
import pandas as pd
from scipy.stats import mstats

def using_mstats_df(df):
    return df.apply(using_mstats, axis=0)

def using_mstats(s):
    return mstats.winsorize(s, limits=[0.05, 0.05])

N, M, P = 10**5, 10, 10**2
dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P)
df = pd.DataFrame(np.random.random((N, M))
                  , index=dates)
df.index.names = ['DATE']
grouped = df.groupby(level='DATE')

In [122]: %timeit result = grouped.apply(winsorize_df)
1 loops, best of 3: 17.8 s per loop

In [123]: %timeit mstats_result = grouped.apply(using_mstats_df)
1 loops, best of 3: 11.2 s per loop
like image 91
unutbu Avatar answered Oct 12 '22 15:10

unutbu


Here is a solution without using scipy.stats.mstats:

def clip_series(s, lower, upper):
   clipped = s.clip(lower=s.quantile(lower), upper=s.quantile(upper), axis=1)
   return clipped

# Manage list of features to be winsorized
feature_list = list(features.columns)

for f in feature_list:
   features[f] = clip_series(features[f], 0.05, 0.95)
like image 41
tnf Avatar answered Oct 12 '22 14:10

tnf