Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python generate rolling window to calculate correlation

I have a large pandas dataframe (97165 rows and 2 columns) and I'd like to calculate and save the correlation among those columns for each 100 rows I wanna something like this:

1st correlation --> rows from 0 to 100 --> corr = 0.265

2nd correlation --> rows from 1 to 101 --> corr = 0.279

3rd correlation --> rows from 2 to 102 --> corr = 0.287

Each value has to be stored and after showed in a plot so I have to save all this values in a list or something like this.

I have been reading the pandas documentation related to rolling window pandas rolling window but I was not able to achieve anything. I have tried to generate a simple loop to get some results but I got memory problems, the code of what I have tried is:

lcl = 100
a = []
for i in range(len(tabla)):

    x = tabla.iloc[i:lcl, [0]] 
    y = tabla.iloc[i:lcl, [1]]
    z = x['2015_Avion'].corr(y['2015_Hotel'])
    a.append(z) 
    lcl += 1

Any suggestions?

like image 831
WinterZ Avatar asked Jan 30 '23 09:01

WinterZ


2 Answers

We can optimize on memory and performance by working with array data.

Approach #1

First off, let's have an array solution to get the correlation coefficients for corresponding elements between two 1D arrays. This would be basically inspired by this post and would look something like this -

def corrcoeff_1d(A,B):
    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - A.mean(-1,keepdims=1)
    B_mB = B - B.mean(-1,keepdims=1)

    # Sum of squares
    ssA = np.einsum('i,i->',A_mA, A_mA)
    ssB = np.einsum('i,i->',B_mB, B_mB)

    # Finally get corr coeff
    return np.einsum('i,i->',A_mA,B_mB)/np.sqrt(ssA*ssB)

Now, to use it , use the same loop but on array data -

lcl = 100
ar = tabla.values
N = len(ar)
out = np.zeros(N)
for i in range(N):
    out[i] = corrcoeff_1d(ar[i:i+lcl,0], ar[i:i+lcl,1])

We can optimize further on performance by pre-computing the rolling mean values as used to compute A_mA in corrcoeff_1d with convolution, but first let's get the memory error out of the way.

Approach #2

Here's an almost vectorized approach as we would vectorize most of the iterations except for the leftover slices at the end that won't have proper window lengths. The loop count would be reduced from 97165 to lcl-1 i.e. mere 99.

lcl = 100
ar = tabla.values
N = len(ar)
out = np.zeros(N)

col0_win = strided_app(ar[:,0],lcl,S=1)
col1_win = strided_app(ar[:,1],lcl,S=1)
vectorized_out = corr2_coeff_rowwise(col0_win, col1_win)
M = len(vectorized_out)
out[:M] = vectorized_out

for i in range(M,N):
    out[i] = corrcoeff_1d(ar[i:i+lcl,0], ar[i:i+lcl,1])

Helper funcs -

# https://stackoverflow.com/a/40085052/ @ Divakar
def strided_app(a, L, S ):  # Window len = L, Stride len/stepsize = S
    nrows = ((a.size-L)//S)+1
    n = a.strides[0]
    return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))

# https://stackoverflow.com/a/41703623/ @Divakar
def corr2_coeff_rowwise(A,B):
    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - A.mean(-1,keepdims=1)
    B_mB = B - B.mean(-1,keepdims=1)

    # Sum of squares across rows
    ssA = np.einsum('ij,ij->i',A_mA, A_mA)
    ssB = np.einsum('ij,ij->i',B_mB, B_mB)

    # Finally get corr coeff
    return np.einsum('ij,ij->i',A_mA,B_mB)/np.sqrt(ssA*ssB)

Correlation for NaN filled data

NumPy solutions to Pandas based correlation computation for computing correlation between 1D arrays and row-wise correlation values are listed next.

1) Scalar correlation value between two 1D arrays -

def nancorrcoeff_1d(A,B):
    # Get combined mask
    comb_mask = ~(np.isnan(A) & ~np.isnan(B))
    count = comb_mask.sum()

    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - np.nansum(A * comb_mask,-1,keepdims=1)/count
    B_mB = B - np.nansum(B * comb_mask,-1,keepdims=1)/count

    # Replace NaNs with zeros, so that later summations could be computed    
    A_mA[~comb_mask] = 0
    B_mB[~comb_mask] = 0

    ssA = np.inner(A_mA,A_mA)
    ssB = np.inner(B_mB,B_mB)

    # Finally get corr coeff
    return np.inner(A_mA,B_mB)/np.sqrt(ssA*ssB)

2) Row-wise correlation between two 2D arrays (m,n) to give us a 1D array of shape (m,) -

def nancorrcoeff_rowwise(A,B):
    # Input : Two 2D arrays of same shapes (mxn). Output : One 1D array  (m,)
    # Get combined mask
    comb_mask = ~(np.isnan(A) & ~np.isnan(B))
    count = comb_mask.sum(axis=-1,keepdims=1)

    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - np.nansum(A * comb_mask,-1,keepdims=1)/count
    B_mB = B - np.nansum(B * comb_mask,-1,keepdims=1)/count

    # Replace NaNs with zeros, so that later summations could be computed    
    A_mA[~comb_mask] = 0
    B_mB[~comb_mask] = 0

    # Sum of squares across rows
    ssA = np.einsum('ij,ij->i',A_mA, A_mA)
    ssB = np.einsum('ij,ij->i',B_mB, B_mB)

    # Finally get corr coeff
    return np.einsum('ij,ij->i',A_mA,B_mB)/np.sqrt(ssA*ssB)
like image 160
Divakar Avatar answered Feb 01 '23 00:02

Divakar


You mentioned trying rolling. What exactly went wrong with that? This works for me:

my_res = tabla['2015_Avion'].rolling(100).corr(tabla['2015_Hotel'])

my_res will have NaN values until its 100th value, so my_res[99] should be the correlation between the row 0 and row 99 elements of both columns, as would be returned by pandas corr function applied only to the subset. my_res[100] is the correlation between the row 1 and row 100 elements.

like image 29
Antoine Zambelli Avatar answered Jan 31 '23 22:01

Antoine Zambelli