Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compute rolling z-score in pandas dataframe

Tags:

python

pandas

Is there a open source function to compute moving z-score like https://turi.com/products/create/docs/generated/graphlab.toolkits.anomaly_detection.moving_zscore.create.html. I have access to pandas rolling_std for computing std, but want to see if it can be extended to compute rolling z scores.

like image 261
user308827 Avatar asked Nov 07 '17 18:11

user308827


People also ask

What is rolling average in pandas?

A moving average, also called a rolling or running average, is used to analyze the time-series data by calculating averages of different subsets of the complete dataset. Since it involves taking the average of the dataset over time, it is also called a moving mean (MM) or rolling mean.

What is Zscore Python?

zscore(arr, axis=0, ddof=0) function computes the relative Z-score of the input data, relative to the sample mean and standard deviation. Its formula: Parameters : arr : [array_like] Input array or object for which Z-score is to be calculated.


1 Answers

rolling.apply with a custom function is significantly slower than using builtin rolling functions (such as mean and std). Therefore, compute the rolling z-score from the rolling mean and rolling std:

def zscore(x, window):
    r = x.rolling(window=window)
    m = r.mean().shift(1)
    s = r.std(ddof=0).shift(1)
    z = (x-m)/s
    return z

According to the definition given on this page the rolling z-score depends on the rolling mean and std just prior to the current point. The shift(1) is used above to achieve this effect.


Below, even for a small Series (of length 100), zscore is over 5x faster than using rolling.apply. Since rolling.apply(zscore_func) calls zscore_func once for each rolling window in essentially a Python loop, the advantage of using the Cythonized r.mean() and r.std() functions becomes even more apparent as the size of the loop increases. Thus, as the length of the Series increases, the speed advantage of zscore increases.

In [58]: %timeit zscore(x, N)
1000 loops, best of 3: 903 µs per loop

In [59]: %timeit zscore_using_apply(x, N)
100 loops, best of 3: 4.84 ms per loop

This is the setup used for the benchmark:

import numpy as np
import pandas as pd
np.random.seed(2017)

def zscore(x, window):
    r = x.rolling(window=window)
    m = r.mean().shift(1)
    s = r.std(ddof=0).shift(1)
    z = (x-m)/s
    return z


def zscore_using_apply(x, window):
    def zscore_func(x):
        return (x[-1] - x[:-1].mean())/x[:-1].std(ddof=0)
    return x.rolling(window=window+1).apply(zscore_func)

N = 5
x = pd.Series((np.random.random(100) - 0.5).cumsum())

result = zscore(x, N)
alt = zscore_using_apply(x, N)

assert not ((result - alt).abs() > 1e-8).any()
like image 165
unutbu Avatar answered Sep 20 '22 14:09

unutbu