Pandas expanding/rolling window correlation calculation with p-value

Tags:

Suppose I have a DataFrame, on which I want to calculate rolling or expanding Pearson correlations between two columns

import numpy as np
import pandas as pd
import scipy.stats as st


df = pd.DataFrame({'x': np.random.rand(10000), 'y': np.random.rand(10000)})

With the inbuilt pandas functionality it's very fast to calculate this

expanding_corr = df['x'].expanding(50).corr(df['y'])
rolling_corr = df['x'].rolling(50).corr(df['y'])

However, if I wish to get the p-values associated with these correlations the best I could do was define a custom rolling function and pass apply to groupby object

def custom_roll(df, w, **kwargs):

    v = df.values
    d0, d1 = v.shape
    s0, s1 = v.strides
    a = np.lib.stride_tricks.as_strided(v, (d0 - (w - 1), w, d1), (s0, s0, s1))
    rolled_df = pd.concat({
        row: pd.DataFrame(values, columns=df.columns)
        for row, values in zip(df.index[(w-1):], a)
    })
    return rolled_df.groupby(level=0, **kwargs)

c_df = custom_roll(df, 50).apply(lambda df: st.pearsonr(df['x'], df['y']))

c_df now contains the appropriate correlations and importantly their associated p-values.

However, this method is extremely slow compared to the inbuilt pandas method, which means it is not suitable, as practically I am calculating these correlations thousands of times during an optimisation process. Furthermore, I am unsure how to extend the custom_roll function to work for expanding windows.

Can anyone point me in the direction of leveraging numpy to get the p-values over expanding windows at vectorised speeds?

586

asked Jun 24 '19 09:06

mch56

2 Answers

I could not think of a clever way to do this in pandas using rolling directly, but note that you can calculate the p-value given the correlation coefficient.

Pearson's correlation coefficient follows Student's t-distribution and you can get the p-value by plugging it to the cdf defined by the incomplete beta function, scipy.special.betainc. It sounds complicated but can be done in a few lines of code. Below is a function that computes the p-value given the correlation coefficient corr and the sample size n. It is actually based on scipy's implementation you have been using.

import pandas as pd
from scipy.special import betainc

def pvalue(corr, n=50):
    df = n - 2
    t_squared = corr**2 * (df / ((1.0 - corr) * (1.0 + corr)))
    prob = betainc(0.5*df, 0.5, df/(df+t_squared))
    return prob

You can then apply this function to the correlation values you already have.

rolling_corr = df['x'].rolling(50).corr(df['y'])
pvalue(rolling_corr)

It might not be the perfect vectorised numpy solution but should be tens of times faster than calculating the correlations over and over again.

176

answered Sep 18 '22 08:09

AlCorreia

Approach #1

corr2_coeff_rowwise lists how to do element-wise correlation between rows. We could strip it down to a case for element-wise correlation between two columns. So, we would end up with a loop that uses corr2_coeff_rowwise. Then, we would try to vectorize it and see that there are pieces in it that could be vectorized :

Getting average values with mean. This could be vectorized with use of uniform filter.
Next up was getting the differences between those average values against sliding elements from input arrays. To port to a vectorized one, we would make use of broadcasting.

Rest stays the same to get the first off the two outputs from pearsonr.

To get the second output, we go back to the source code. This should be straight-forward given the first coefficient output.

So, with those in mind, we would end up with something like this -

import scipy.special as special
from scipy.ndimage import uniform_filter

def sliding_corr1(a,b,W):
    # a,b are input arrays; W is window length

    am = uniform_filter(a.astype(float),W)
    bm = uniform_filter(b.astype(float),W)

    amc = am[W//2:-W//2+1]
    bmc = bm[W//2:-W//2+1]

    da = a[:,None]-amc
    db = b[:,None]-bmc

    # Get sliding mask of valid windows
    m,n = da.shape
    mask1 = np.arange(m)[:,None] >= np.arange(n)
    mask2 = np.arange(m)[:,None] < np.arange(n)+W
    mask = mask1 & mask2
    dam = (da*mask)
    dbm = (db*mask)

    ssAs = np.einsum('ij,ij->j',dam,dam)
    ssBs = np.einsum('ij,ij->j',dbm,dbm)
    D = np.einsum('ij,ij->j',dam,dbm)
    coeff = D/np.sqrt(ssAs*ssBs)

    n = W
    ab = n/2 - 1
    pval = 2*special.btdtr(ab, ab, 0.5*(1 - abs(np.float64(coeff))))
    return coeff,pval

Thus, to get the final output off the inputs from the pandas series -

out = sliding_corr1(df['x'].to_numpy(copy=False),df['y'].to_numpy(copy=False),50)

Approach #2

A lot similar to Approach #1, but we will use numba to improve on the memory efficiency to replace step #2 from previous approach.

from numba import njit
import math

@njit(parallel=True)
def sliding_corr2_coeff(a,b,amc,bmc):
    L = len(a)-W+1
    out00 = np.empty(L)
    for i in range(L):
        out_a = 0
        out_b = 0
        out_D = 0
        for j in range(W):
            d_a = a[i+j]-amc[i]
            d_b = b[i+j]-bmc[i]
            out_D += d_a*d_b
            out_a += d_a**2
            out_b += d_b**2
        out00[i] = out_D/math.sqrt(out_a*out_b)
    return out00

def sliding_corr2(a,b,W):
    am = uniform_filter(a.astype(float),W)
    bm = uniform_filter(b.astype(float),W)

    amc = am[W//2:-W//2+1]
    bmc = bm[W//2:-W//2+1]

    coeff = sliding_corr2_coeff(a,b,amc,bmc)

    ab = W/2 - 1
    pval = 2*special.btdtr(ab, ab, 0.5*(1 - abs(np.float64(coeff))))
    return coeff,pval

Approach #3

Very similar to previous one, except that we are pushing all of the coefficient work to numba -

@njit(parallel=True)
def sliding_corr3_coeff(a,b,W):
    L = len(a)-W+1
    out00 = np.empty(L)
    for i in range(L):
        a_mean = 0.0
        b_mean = 0.0
        for j in range(W):
            a_mean += a[i+j]
            b_mean += b[i+j]
        a_mean /= W
        b_mean /= W

        out_a = 0
        out_b = 0
        out_D = 0
        for j in range(W):
            d_a = a[i+j]-a_mean
            d_b = b[i+j]-b_mean
            out_D += d_a*d_b
            out_a += d_a*d_a
            out_b += d_b*d_b
        out00[i] = out_D/math.sqrt(out_a*out_b)
    return out00

def sliding_corr3(a,b,W):    
    coeff = sliding_corr3_coeff(a,b,W)
    ab = W/2 - 1
    pval = 2*special.btdtr(ab, ab, 0.5*(1 - np.abs(coeff)))
    return coeff,pval

Timings -

In [181]: df = pd.DataFrame({'x': np.random.rand(10000), 'y': np.random.rand(10000)})

In [182]: %timeit sliding_corr2(df['x'].to_numpy(copy=False),df['y'].to_numpy(copy=False),50)
100 loops, best of 3: 5.05 ms per loop

In [183]: %timeit sliding_corr3(df['x'].to_numpy(copy=False),df['y'].to_numpy(copy=False),50)
100 loops, best of 3: 5.51 ms per loop

Note :

sliding_corr1 seems to be taking long on this dataset and most probably because of the memory-requirement from its step#2.
The bottleneck after using the numba funcs, then transfers to the p-val computation with special.btdtr.

answered Sep 20 '22 08:09

Divakar

Related questions
                            
                                Right way to plot live data with django and bokeh
                            
                                Adding group bar charts as subplots in plotly
                            
                                How to disable pytest plugins for single tests
                            
                                Table legend in matplotlib
                            
                                How to keep track of players' rankings?
                            
                                How can I learn how to implement a custom Python asyncio event loop?
                            
                                Why is pandas '==' different than '.eq()'
                            
                                python SyntaxError: positional argument follows keyword argument [duplicate]
                            
                                How can I build a GUI to use inside a jupyter notebook?
                            
                                matplotlib Slow 3D scatter rotation
                            
                                Selenium driver's page source different than browser
                            
                                Access flask.g inside greenlet
                            
                                Comparison operators vs “rich comparison” methods in Python
                            
                                How to remove In[ ] and Out[ ] cell tags in a Jupyterlab notebook?
                            
                                How are PyTorch's tensors implemented?
                            
                                Read ZIP files from S3 without downloading the entire file
                            
                                Python 2 -> 3 Django migration causes field parameter type change
                            
                                Removing duplicates on very large datasets
                            
                                The loss function and evaluation metric of XGBoost
                            
                                Need to expand an inventory journal (log) pandas dataframe to include all dates per product id

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas expanding/rolling window correlation calculation with p-value

Tags:

python

optimization

pandas

vectorization

numpy

mch56

People also ask

2 Answers

AlCorreia

Divakar

Recent Activity

Donate For Us