Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to tackle inconsistent results while using pandas rolling correlation?

Let me preface this by saying, in order to reproduce the problem I need a large data, and that is part of the problem, that I can't predict when the peculiarity is going to pop up. Anyway, the data is too large (~13k rows, 2 cols) to be pasted in the question, I have added a pastebin link at the end of the post.


I am facing a peculiar problem for the past few days with pandas.core.window.rolling.Rolling.corr. I have a dataset, where I am trying to calculate rolling correlations. This is the problem:

While calculating rolling (window_size=100) correlations between two columns (a and b): some indices (one such index is 12981) give near 0 values (of order 1e-10), but it should ideally return nan or inf, (because all values in one column are constant). However, if I just calculate standalone correlation pertaining to that index, (i.e. last 100 rows of data including the said index), or perform the rolling calculations on lesser amount of rows (e.g. 300 or 1000 as opposed to 13k), I get the correct result (i.e. nan or inf.)

Expectation:

>>> df = pd.read_csv('sample_corr_data.csv') # link at the end,  ## columns = ['a', 'b']
>>> df.a.tail(100).value_counts()

 0.000000    86
-0.000029     3
 0.000029     3
-0.000029     2
 0.000029     2
-0.000029     2
 0.000029     2
Name: a, dtype: int64

>>> df.b.tail(100).value_counts()     # all 100 values are same
 
6.0    100
Name: b, dtype: int64

>>> df.a.tail(100).corr(df.b.tail(100))
nan                                      # expected, because column 'b' has same value throughout

# Made sure of this using,
# 1. np.corrcoef, because pandas uses this internally to calculate pearson moments
>>> np.corrcoef(df.a.tail(100), df.b.tail(100))[0, 1]
nan

# 2. using custom function
>>> def pearson(a, b):
        n = a.size
        num = n*np.nansum(a*b) - np.nansum(a)*np.nansum(b)
        den = (n*np.nansum((a**2)) - np.nansum(a)**2)*(n*np.nansum(b**2) - np.nansum(b)**2)
        return num/np.sqrt(den) if den * np.isfinite(den*num) else np.nan

>>> pearson(df.a.tail(100), df.b.tail(100))
nan

Now, the reality:

>>> df.a.rolling(100).corr(df.b).tail(3)
 
12979    7.761921e-07
12980    5.460717e-07
12981    2.755881e-10                    # This should have been NaN/inf !!

## Furthermore!!

>>> debug = df.tail(300)
>>> debug.a.rolling(100).corr(debug.b).tail(3)

12979    7.761921e-07
12980    5.460717e-07
12981            -inf                    # Got -inf, fine
dtype: float64

>>> debug = df.tail(3000)
>>> debug.a.rolling(100).corr(debug.b).tail(3)
 
12979    7.761921e-07
12980    5.460717e-07
12981             inf                     # Got +inf, still acceptable
dtype: float64

This continue till 9369 rows:

>>> debug = df.tail(9369)
>>> debug.a.rolling(100).corr(debug.b).tail(3)

12979    7.761921e-07
12980    5.460717e-07
12981             inf
dtype: float64

# then
>>> debug = df.tail(9370)
>>> debug.a.rolling(100).corr(debug.b).tail(3)

12979    7.761921e-07
12980    5.460717e-07
12981    4.719615e-10                    # SPOOKY ACTION IN DISTANCE!!!
dtype: float64

>>> debug = df.tail(10000)
>>> debug.a.rolling(100).corr(debug.b).tail(3)
 
12979    7.761921e-07
12980    5.460717e-07
12981    1.198994e-10                    # SPOOKY ACTION IN DISTANCE!!!    
dtype: float64

Current Workaround

>>> df.a.rolling(100).apply(lambda x: x.corr(df.b.reindex(x.index))).tail(3)   # PREDICTABLY, VERY SLOW!

12979    7.761921e-07
12980    5.460717e-07
12981             NaN
Name: a, dtype: float64

# again this checks out using other methods,
>>> df.a.rolling(100).apply(lambda x: np.corrcoef(x, df.b.reindex(x.index))[0, 1]).tail(3)
 
12979    7.761921e-07
12980    5.460717e-07
12981             NaN
Name: a, dtype: float64

>>> df.a.rolling(100).apply(lambda x: pearson(x, df.b.reindex(x.index))).tail(3)

12979    7.761921e-07
12980    5.460717e-07
12981             NaN
Name: a, dtype: float64

As far as I understand, the result of series.rolling(n).corr(other_series) should match with the following:

>>> def rolling_corr(series, other_series, n=100):
        return pd.Series(
                    [np.nan]*(n-1) + [series[i-n: i].corr(other_series[i-n:i]) 
                    for i in range (n, series.size+1)]
        )

>>> rolling_corr(df.a, df.b).tail(3)

12979    7.761921e-07
12980    5.460717e-07
12981             NaN

First I thought this was a floating-point arithmetic issue (because initially, in some cases, I could fix this by rounding column 'a' to 5 decimal places, or casting to float32), but in that case it would be present irrespective of the number of samples used. So there must be some issue with rolling or at least rolling gives rise to floating-point issues depending on size of the data. I checked source code of rolling.corr, but could not find anything that would explain such inconsistencies. And now I am worried, how many past codes are plagued with this issue.

What is the reason behind this? And how to fix this? If this is happening because may be pandas prefers speed over accuracy (as suggested here), does that mean I can never reliably use pandas.rolling operations on large sample? How do I know the size beyond which this inconsistency would appear?


sample_corr_data.csv: https://pastebin.com/jXXHSv3r

Tested in

  • Windows 10, python 3.9.1, pandas 1.2.2, (IPython 7.20)
  • Windows 10, python 3.8.2, pandas 1.0.5, (IPython 7.19)
  • Ubuntu 20.04, python 3.7.7, pandas 1.0.5, (GCC 7.3.0, standard REPL)
  • CentOS Linux 7 (Core), Python 2.7.5, pandas 0.23.4, (IPython 5.8.0)

Note: Different OS return different values at the said index, but all are finite and near 0.

like image 405
Sayandip Dutta Avatar asked Mar 13 '21 15:03

Sayandip Dutta


1 Answers

What if you compute you replace the sums in your pearson formula with rolling sums


def rolling_pearson(a, b, n):
    a_sum = a.rolling(n).sum()
    b_sum = b.rolling(n).sum()
    ab_sum = (a*b).rolling(n).sum()
    aa_sum = (a**2).rolling(n).sum()
    bb_sum = (b**2).rolling(n).sum();
    
    num = n * ab_sum - a_sum * b_sum;
    den = (n*aa_sum - a_sum**2) * (n * bb_sum - b_sum**2)
    return num / den**(0.5)

rolling_pearson(df.a, df.b, 100)

             ...     
12977    1.109077e-06
12978    9.555249e-07
12979    7.761921e-07
12980    5.460717e-07
12981             inf
Length: 12982, dtype: float64

Why is this so

In order to answer this question I needed to check the implementation. Because indeed the variance of the last 100 samples of b is zero, and the rolling correlation is computed as a.cov(b) / (a.var() * b.var())**0.5.

After some search I found the rolling variance implementation here, the method they are using is the Welford's online algorithm. This algorithm is nice because you can add one sample using only one multiplication (the same as the methods with cumulative sums), and you can calculate with a single integer division. Here rewrite it in python.

def welford_add(existingAggregate, newValue):
    if pd.isna(newValue):
        return s
    (count, mean, M2) = existingAggregate
    count += 1
    delta = newValue - mean
    mean += delta / count
    delta2 = newValue - mean
    M2 += delta * delta2
    return (count, mean, M2)
def welford_remove(existingAggregate, newValue):
    if pd.isna(newValue):
        return s
    (count, mean, M2) = existingAggregate
    count -= 1
    delta = newValue - mean
    mean -= delta / count
    delta2 = newValue - mean
    M2 -= delta * delta2
    return (count, mean, M2)
def finalize(existingAggregate):
    (count, mean, M2) = existingAggregate
    (mean, variance, sampleVariance) = (mean, 
            M2 / count if count > 0 else None, 
            M2 / (count - 1) if count > 1 else None)
    return (mean, variance, sampleVariance)

In the pandas implementation they mention the Kahan's summation, this is important to get better precision in additions, but the results are not improved by that (I didn't check if whether if it is properly implemented or not).

Applying the Welford algorithm with n=100

s = (0,0,0)
for i in range(len(df.b)):
    if i >= n:
        s = welford_remove(s, df.b[i-n])
    s = welford_add(s, df.b[i])
finalize(s)

It gives

(6.000000000000152, 4.7853099260919405e-12, 4.8336463899918594e-12)

And the df.b.rolling(100).var() gives

0                 NaN
1                 NaN
2                 NaN
3                 NaN
4                 NaN
             ...     
12977    6.206061e-01
12978    4.703030e-01
12979    3.167677e-01
12980    1.600000e-01
12981    6.487273e-12
Name: b, Length: 12982, dtype: float64

With error 6.4e-12 slightly higher than the 4.83e-12 given by direct application of the Welford's method.

On the other hand (df.b**2).rolling(n).sum()-df.b.rolling(n).sum()**2/n gives 0.0 for the last entry.

0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
         ...  
12977    61.44
12978    46.56
12979    31.36
12980    15.84
12981     0.00
Name: b, Length: 12982, dtype: float64

I hope this explanation is satisfactory :)

like image 82
Bob Avatar answered Sep 26 '22 16:09

Bob