I want to calculate the Spearman and/or Pearson Correlation between two columns of a DataFrame, using a rolling window.
I have tried df['corr'] = df['col1'].rolling(P).corr(df['col2'])
(P is the window size)
but i don't seem to be able to define the method. (Adding method='spearman'
as argument produces error:
File "main.py", line 29, in __init__
_df['corr'] = g['col1'].rolling(P).corr(g['col2'], method = corr_function)
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1287, in corr
**kwargs)
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1054, in corr
_get_corr, pairwise=bool(pairwise))
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1866, in _flex_binary_moment
return f(X, Y)
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1051, in _get_corr
return a.cov(b, **kwargs) / (a.std(**kwargs) * b.std(**kwargs))
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1280, in cov
ddof=ddof, **kwargs)
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1020, in cov
_get_cov, pairwise=bool(pairwise))
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1866, in _flex_binary_moment
return f(X, Y)
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1015, in _get_cov
center=self.center).count(**kwargs)
TypeError: count() got an unexpected keyword argument 'method'
To be fair, i wasn't expecting this to work, since reading the documentation, there is no mention that rolling.corr
supports methods...
Any suggestions on how to do this, taking into account that the dataframe is quite big (>10M rows)?
rolling.corr
does Pearson, so you can use it for that. For Spearman, use something like this:
import pandas as pd
from numpy.lib.stride_tricks import as_strided
from numpy.lib import pad
import numpy as np
def rolling_spearman(seqa, seqb, window):
stridea = seqa.strides[0]
ssa = as_strided(seqa, shape=[len(seqa) - window + 1, window], strides=[stridea, stridea])
strideb = seqa.strides[0]
ssb = as_strided(seqb, shape=[len(seqb) - window + 1, window], strides =[strideb, strideb])
ar = pd.DataFrame(ssa)
br = pd.DataFrame(ssb)
ar = ar.rank(1)
br = br.rank(1)
corrs = ar.corrwith(br, 1)
return pad(corrs, (window - 1, 0), 'constant', constant_values=np.nan)
E.g.:
In [144]: df = pd.DataFrame(np.random.randint(0,1000,size=(10,2)), columns = list('ab'))
In [145]: df['corr'] = rolling_spearman(df.a, df.b, 4)
In [146]: df
Out[146]:
a b corr
0 429 922 NaN
1 618 382 NaN
2 476 517 NaN
3 267 536 -0.8
4 582 844 -0.4
5 254 895 -0.4
6 583 974 0.4
7 687 298 -0.4
8 697 447 -0.6
9 383 35 0.4
Explanation: numpy.lib.stride_tricks.as_strided
is a hacky method that in this case gives us a view of the sequences that looks like a 2d array with the rolling window sections of the sequence we're looking at.
From then on, it's simple. Spearman correlation is equivalent to transforming the sequences to ranks, and taking the Pearson correlation coefficient. Pandas has, helpfully, got fast implementations to do this row-wise on DataFrame
s. Then at the end we pad the start of the resulting Series
with NaN values (so you can add it as a column to your dataframe or whatever).
(Personal note: I spent so long trying to figure out how to do this efficiently with numpy and scipy before I realised everything you need is in pandas already...!).
To show the speed advantage of this method over just looping over the sliding windows, I made a little file called srsmall.py
containing:
import pandas as pd
from numpy.lib.stride_tricks import as_strided
import scipy.stats
from numpy.lib import pad
import numpy as np
def rolling_spearman_slow(seqa, seqb, window):
stridea = seqa.strides[0]
ssa = as_strided(seqa, shape=[len(seqa) - window + 1, window], strides=[stridea, stridea])
strideb = seqa.strides[0]
ssb = as_strided(seqb, shape=[len(seqb) - window + 1, window], strides =[strideb, strideb])
corrs = [scipy.stats.spearmanr(a, b)[0] for (a,b) in zip(ssa, ssb)]
return pad(corrs, (window - 1, 0), 'constant', constant_values=np.nan)
def rolling_spearman_quick(seqa, seqb, window):
stridea = seqa.strides[0]
ssa = as_strided(seqa, shape=[len(seqa) - window + 1, window], strides=[stridea, stridea])
strideb = seqa.strides[0]
ssb = as_strided(seqb, shape=[len(seqb) - window + 1, window], strides =[strideb, strideb])
ar = pd.DataFrame(ssa)
br = pd.DataFrame(ssb)
ar = ar.rank(1)
br = br.rank(1)
corrs = ar.corrwith(br, 1)
return pad(corrs, (window - 1, 0), 'constant', constant_values=np.nan)
And compare the performance:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: from srsmall import rolling_spearman_slow as slow
In [4]: from srsmall import rolling_spearman_quick as quick
In [5]: for i in range(6):
...: df = pd.DataFrame(np.random.randint(0,1000,size=(10*(10**i),2)), columns=list('ab'))
...: print len(df), " rows"
...: print "quick: ",
...: %timeit quick(df.a, df.b, 4)
...: print "slow: ",
...: %timeit slow(df.a, df.b, 4)
...:
10 rows
quick: 100 loops, best of 3: 3.52 ms per loop
slow: 100 loops, best of 3: 3.2 ms per loop
100 rows
quick: 100 loops, best of 3: 3.53 ms per loop
slow: 10 loops, best of 3: 42 ms per loop
1000 rows
quick: 100 loops, best of 3: 3.82 ms per loop
slow: 1 loop, best of 3: 430 ms per loop
10000 rows
quick: 100 loops, best of 3: 7.47 ms per loop
slow: 1 loop, best of 3: 4.33 s per loop
100000 rows
quick: 10 loops, best of 3: 50.2 ms per loop
slow: 1 loop, best of 3: 43.4 s per loop
1000000 rows
quick: 1 loop, best of 3: 404 ms per loop
slow:
On a million rows (on my machine), the quick (pandas) version runs in less than half a second. Not shown above but 10 million took 8.43 seconds. The slow one is still running, but assuming the linear growth continues it should take around 7 minutes for 1M and over an hour for 10M.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With