Memory efficient ways of computing large correlation matrices? [duplicate]

Question

I have two matrices where the variables are the columns , and both matrices have the same number of samples.

One matrix is 800 by 200, and the other is 800 by 100000. I want to compute the correlation matrix between the columns of these matrices so I've tried this:

import numpy as np

def matcor(x, y):
    xc = x.shape[1]
    return np.corrcoef(x, y, rowvar=False)[xc:, :xc]

xy_cor = matcor(X, Y)

However this ends up taking a large amount of memory, I get a memory error at around 64GB of memory used, and it might end up taking up more than that. Is there a memory efficient way to compute this ?

Roland W · Accepted Answer

Unfortunately, the cov and corrcoef functions don't allow a direct calculation of only the xy correlation. Since the problem is obviously too big to be tackled in full, you cannot compute the full matrix and extract the slice afterwards, which is what you are currently doing. Instead, compute the xy part by hand:

samples = x.shape[0]
centered_x = x - np.sum(x, axis=0, keepdims=True) / samples 
centered_y = y - np.sum(y, axis=0, keepdims=True) / samples 
cov_xy = 1./(samples - 1) * np.dot(centered_x.T, centered_y)
var_x = 1./(samples - 1) * np.sum(centered_x**2, axis=0)
var_y = 1./(samples - 1) * np.sum(centered_y**2, axis=0)
corrcoef_xy = cov_xy / np.sqrt(var_x[:, None] * var_y[None,:])

You need the variances to normalize the covariance matrix. Else, only the first four lines would be needed.

Memory efficient ways of computing large correlation matrices? [duplicate]

Tags:

python

matrix

numpy

UberStuper

1 Answers

Roland W

Recent Activity

Donate For Us

Memory efficient ways of computing large correlation matrices? [duplicate]

Tags:

python

matrix

numpy

UberStuper

1 Answers

Roland W

Related questions

Recent Activity

Donate For Us