I have a large pandas dataframe (97165 rows and 2 columns) and I'd like to calculate and save the correlation among those columns for each 100 rows I wanna something like this:
1st correlation --> rows from 0 to 100 --> corr = 0.265
2nd correlation --> rows from 1 to 101 --> corr = 0.279
3rd correlation --> rows from 2 to 102 --> corr = 0.287
Each value has to be stored and after showed in a plot so I have to save all this values in a list or something like this.
I have been reading the pandas documentation related to rolling window pandas rolling window but I was not able to achieve anything. I have tried to generate a simple loop to get some results but I got memory problems, the code of what I have tried is:
lcl = 100
a = []
for i in range(len(tabla)):
x = tabla.iloc[i:lcl, [0]]
y = tabla.iloc[i:lcl, [1]]
z = x['2015_Avion'].corr(y['2015_Hotel'])
a.append(z)
lcl += 1
Any suggestions?
We can optimize on memory and performance by working with array data.
Approach #1
First off, let's have an array solution to get the correlation coefficients for corresponding elements between two 1D
arrays. This would be basically inspired by this post
and would look something like this -
def corrcoeff_1d(A,B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(-1,keepdims=1)
B_mB = B - B.mean(-1,keepdims=1)
# Sum of squares
ssA = np.einsum('i,i->',A_mA, A_mA)
ssB = np.einsum('i,i->',B_mB, B_mB)
# Finally get corr coeff
return np.einsum('i,i->',A_mA,B_mB)/np.sqrt(ssA*ssB)
Now, to use it , use the same loop but on array data -
lcl = 100
ar = tabla.values
N = len(ar)
out = np.zeros(N)
for i in range(N):
out[i] = corrcoeff_1d(ar[i:i+lcl,0], ar[i:i+lcl,1])
We can optimize further on performance by pre-computing the rolling mean values as used to compute A_mA
in corrcoeff_1d
with convolution
, but first let's get the memory error out of the way.
Approach #2
Here's an almost vectorized approach as we would vectorize most of the iterations except for the leftover slices at the end that won't have proper window lengths. The loop count would be reduced from 97165
to lcl-1
i.e. mere 99
.
lcl = 100
ar = tabla.values
N = len(ar)
out = np.zeros(N)
col0_win = strided_app(ar[:,0],lcl,S=1)
col1_win = strided_app(ar[:,1],lcl,S=1)
vectorized_out = corr2_coeff_rowwise(col0_win, col1_win)
M = len(vectorized_out)
out[:M] = vectorized_out
for i in range(M,N):
out[i] = corrcoeff_1d(ar[i:i+lcl,0], ar[i:i+lcl,1])
Helper funcs -
# https://stackoverflow.com/a/40085052/ @ Divakar
def strided_app(a, L, S ): # Window len = L, Stride len/stepsize = S
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
# https://stackoverflow.com/a/41703623/ @Divakar
def corr2_coeff_rowwise(A,B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(-1,keepdims=1)
B_mB = B - B.mean(-1,keepdims=1)
# Sum of squares across rows
ssA = np.einsum('ij,ij->i',A_mA, A_mA)
ssB = np.einsum('ij,ij->i',B_mB, B_mB)
# Finally get corr coeff
return np.einsum('ij,ij->i',A_mA,B_mB)/np.sqrt(ssA*ssB)
Correlation for NaN filled data
NumPy solutions to Pandas based correlation computation for computing correlation between 1D arrays and row-wise correlation values are listed next.
1) Scalar correlation value between two 1D arrays -
def nancorrcoeff_1d(A,B):
# Get combined mask
comb_mask = ~(np.isnan(A) & ~np.isnan(B))
count = comb_mask.sum()
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - np.nansum(A * comb_mask,-1,keepdims=1)/count
B_mB = B - np.nansum(B * comb_mask,-1,keepdims=1)/count
# Replace NaNs with zeros, so that later summations could be computed
A_mA[~comb_mask] = 0
B_mB[~comb_mask] = 0
ssA = np.inner(A_mA,A_mA)
ssB = np.inner(B_mB,B_mB)
# Finally get corr coeff
return np.inner(A_mA,B_mB)/np.sqrt(ssA*ssB)
2) Row-wise correlation between two 2D
arrays (m,n)
to give us a 1D
array of shape (m,)
-
def nancorrcoeff_rowwise(A,B):
# Input : Two 2D arrays of same shapes (mxn). Output : One 1D array (m,)
# Get combined mask
comb_mask = ~(np.isnan(A) & ~np.isnan(B))
count = comb_mask.sum(axis=-1,keepdims=1)
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - np.nansum(A * comb_mask,-1,keepdims=1)/count
B_mB = B - np.nansum(B * comb_mask,-1,keepdims=1)/count
# Replace NaNs with zeros, so that later summations could be computed
A_mA[~comb_mask] = 0
B_mB[~comb_mask] = 0
# Sum of squares across rows
ssA = np.einsum('ij,ij->i',A_mA, A_mA)
ssB = np.einsum('ij,ij->i',B_mB, B_mB)
# Finally get corr coeff
return np.einsum('ij,ij->i',A_mA,B_mB)/np.sqrt(ssA*ssB)
You mentioned trying rolling
. What exactly went wrong with that? This works for me:
my_res = tabla['2015_Avion'].rolling(100).corr(tabla['2015_Hotel'])
my_res
will have NaN
values until its 100
th value, so my_res[99]
should be the correlation between the row 0
and row 99
elements of both columns, as would be returned by pandas
corr
function applied only to the subset. my_res[100]
is the correlation between the row 1
and row 100
elements.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With