I have two matrices where the variables are the columns , and both matrices have the same number of samples.
One matrix is 800 by 200, and the other is 800 by 100000. I want to compute the correlation matrix between the columns of these matrices so I've tried this:
import numpy as np
def matcor(x, y):
xc = x.shape[1]
return np.corrcoef(x, y, rowvar=False)[xc:, :xc]
xy_cor = matcor(X, Y)
However this ends up taking a large amount of memory, I get a memory error at around 64GB of memory used, and it might end up taking up more than that. Is there a memory efficient way to compute this ?
Unfortunately, the cov
and corrcoef
functions don't allow a direct calculation of only the xy
correlation. Since the problem is obviously too big to be tackled in full, you cannot compute the full matrix and extract the slice afterwards, which is what you are currently doing. Instead, compute the xy
part by hand:
samples = x.shape[0]
centered_x = x - np.sum(x, axis=0, keepdims=True) / samples
centered_y = y - np.sum(y, axis=0, keepdims=True) / samples
cov_xy = 1./(samples - 1) * np.dot(centered_x.T, centered_y)
var_x = 1./(samples - 1) * np.sum(centered_x**2, axis=0)
var_y = 1./(samples - 1) * np.sum(centered_y**2, axis=0)
corrcoef_xy = cov_xy / np.sqrt(var_x[:, None] * var_y[None,:])
You need the variances to normalize the covariance matrix. Else, only the first four lines would be needed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With