Let me preface this by saying, in order to reproduce the problem I need a large data, and that is part of the problem, that I can't predict when the peculiarity is going to pop up. Anyway, the data is too large (~13k rows, 2 cols) to be pasted in the question, I have added a pastebin link at the end of the post.
I am facing a peculiar problem for the past few days with pandas.core.window.rolling.Rolling.corr
. I have a dataset, where I am trying to calculate rolling correlations. This is the problem:
While calculating rolling (
window_size=100
) correlations between two columns (a
andb
): some indices (one such index is 12981) give near0
values (of order1e-10
), but it should ideally returnnan
orinf
, (because all values in one column are constant). However, if I just calculate standalone correlation pertaining to that index, (i.e. last 100 rows of data including the said index), or perform the rolling calculations on lesser amount of rows (e.g. 300 or 1000 as opposed to 13k), I get the correct result (i.e.nan
orinf
.)
>>> df = pd.read_csv('sample_corr_data.csv') # link at the end, ## columns = ['a', 'b']
>>> df.a.tail(100).value_counts()
0.000000 86
-0.000029 3
0.000029 3
-0.000029 2
0.000029 2
-0.000029 2
0.000029 2
Name: a, dtype: int64
>>> df.b.tail(100).value_counts() # all 100 values are same
6.0 100
Name: b, dtype: int64
>>> df.a.tail(100).corr(df.b.tail(100))
nan # expected, because column 'b' has same value throughout
# Made sure of this using,
# 1. np.corrcoef, because pandas uses this internally to calculate pearson moments
>>> np.corrcoef(df.a.tail(100), df.b.tail(100))[0, 1]
nan
# 2. using custom function
>>> def pearson(a, b):
n = a.size
num = n*np.nansum(a*b) - np.nansum(a)*np.nansum(b)
den = (n*np.nansum((a**2)) - np.nansum(a)**2)*(n*np.nansum(b**2) - np.nansum(b)**2)
return num/np.sqrt(den) if den * np.isfinite(den*num) else np.nan
>>> pearson(df.a.tail(100), df.b.tail(100))
nan
>>> df.a.rolling(100).corr(df.b).tail(3)
12979 7.761921e-07
12980 5.460717e-07
12981 2.755881e-10 # This should have been NaN/inf !!
## Furthermore!!
>>> debug = df.tail(300)
>>> debug.a.rolling(100).corr(debug.b).tail(3)
12979 7.761921e-07
12980 5.460717e-07
12981 -inf # Got -inf, fine
dtype: float64
>>> debug = df.tail(3000)
>>> debug.a.rolling(100).corr(debug.b).tail(3)
12979 7.761921e-07
12980 5.460717e-07
12981 inf # Got +inf, still acceptable
dtype: float64
This continue till 9369
rows:
>>> debug = df.tail(9369)
>>> debug.a.rolling(100).corr(debug.b).tail(3)
12979 7.761921e-07
12980 5.460717e-07
12981 inf
dtype: float64
# then
>>> debug = df.tail(9370)
>>> debug.a.rolling(100).corr(debug.b).tail(3)
12979 7.761921e-07
12980 5.460717e-07
12981 4.719615e-10 # SPOOKY ACTION IN DISTANCE!!!
dtype: float64
>>> debug = df.tail(10000)
>>> debug.a.rolling(100).corr(debug.b).tail(3)
12979 7.761921e-07
12980 5.460717e-07
12981 1.198994e-10 # SPOOKY ACTION IN DISTANCE!!!
dtype: float64
>>> df.a.rolling(100).apply(lambda x: x.corr(df.b.reindex(x.index))).tail(3) # PREDICTABLY, VERY SLOW!
12979 7.761921e-07
12980 5.460717e-07
12981 NaN
Name: a, dtype: float64
# again this checks out using other methods,
>>> df.a.rolling(100).apply(lambda x: np.corrcoef(x, df.b.reindex(x.index))[0, 1]).tail(3)
12979 7.761921e-07
12980 5.460717e-07
12981 NaN
Name: a, dtype: float64
>>> df.a.rolling(100).apply(lambda x: pearson(x, df.b.reindex(x.index))).tail(3)
12979 7.761921e-07
12980 5.460717e-07
12981 NaN
Name: a, dtype: float64
As far as I understand, the result of series.rolling(n).corr(other_series)
should match with the following:
>>> def rolling_corr(series, other_series, n=100):
return pd.Series(
[np.nan]*(n-1) + [series[i-n: i].corr(other_series[i-n:i])
for i in range (n, series.size+1)]
)
>>> rolling_corr(df.a, df.b).tail(3)
12979 7.761921e-07
12980 5.460717e-07
12981 NaN
First I thought this was a floating-point arithmetic
issue (because initially, in some cases, I could fix this by rounding column 'a' to 5 decimal places, or casting to float32
), but in that case it would be present irrespective of the number of samples used. So there must be some issue with rolling
or at least rolling
gives rise to floating-point
issues depending on size of the data. I checked source code of rolling.corr
, but could not find anything that would explain such inconsistencies. And now I am worried, how many past codes are plagued with this issue.
What is the reason behind this? And how to fix this? If this is happening because may be pandas prefers speed over accuracy (as suggested here), does that mean I can never reliably use pandas.rolling
operations on large sample? How do I know the size beyond which this inconsistency would appear?
Tested in
Note: Different OS return different values at the said index, but all are finite and near 0
.
What if you compute you replace the sums in your pearson formula with rolling sums
def rolling_pearson(a, b, n):
a_sum = a.rolling(n).sum()
b_sum = b.rolling(n).sum()
ab_sum = (a*b).rolling(n).sum()
aa_sum = (a**2).rolling(n).sum()
bb_sum = (b**2).rolling(n).sum();
num = n * ab_sum - a_sum * b_sum;
den = (n*aa_sum - a_sum**2) * (n * bb_sum - b_sum**2)
return num / den**(0.5)
rolling_pearson(df.a, df.b, 100)
...
12977 1.109077e-06
12978 9.555249e-07
12979 7.761921e-07
12980 5.460717e-07
12981 inf
Length: 12982, dtype: float64
In order to answer this question I needed to check the implementation. Because indeed the variance of the last 100 samples of b
is zero, and the rolling correlation is computed as a.cov(b) / (a.var() * b.var())**0.5
.
After some search I found the rolling variance implementation here, the method they are using is the Welford's online algorithm. This algorithm is nice because you can add one sample using only one multiplication (the same as the methods with cumulative sums), and you can calculate with a single integer division. Here rewrite it in python.
def welford_add(existingAggregate, newValue):
if pd.isna(newValue):
return s
(count, mean, M2) = existingAggregate
count += 1
delta = newValue - mean
mean += delta / count
delta2 = newValue - mean
M2 += delta * delta2
return (count, mean, M2)
def welford_remove(existingAggregate, newValue):
if pd.isna(newValue):
return s
(count, mean, M2) = existingAggregate
count -= 1
delta = newValue - mean
mean -= delta / count
delta2 = newValue - mean
M2 -= delta * delta2
return (count, mean, M2)
def finalize(existingAggregate):
(count, mean, M2) = existingAggregate
(mean, variance, sampleVariance) = (mean,
M2 / count if count > 0 else None,
M2 / (count - 1) if count > 1 else None)
return (mean, variance, sampleVariance)
In the pandas implementation they mention the Kahan's summation, this is important to get better precision in additions, but the results are not improved by that (I didn't check if whether if it is properly implemented or not).
Applying the Welford algorithm with n=100
s = (0,0,0)
for i in range(len(df.b)):
if i >= n:
s = welford_remove(s, df.b[i-n])
s = welford_add(s, df.b[i])
finalize(s)
It gives
(6.000000000000152, 4.7853099260919405e-12, 4.8336463899918594e-12)
And the df.b.rolling(100).var()
gives
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
12977 6.206061e-01
12978 4.703030e-01
12979 3.167677e-01
12980 1.600000e-01
12981 6.487273e-12
Name: b, Length: 12982, dtype: float64
With error 6.4e-12
slightly higher than the 4.83e-12
given by direct application of the Welford's method.
On the other hand (df.b**2).rolling(n).sum()-df.b.rolling(n).sum()**2/n
gives 0.0 for the last entry.
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
12977 61.44
12978 46.56
12979 31.36
12980 15.84
12981 0.00
Name: b, Length: 12982, dtype: float64
I hope this explanation is satisfactory :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With