Does anyone know how to compute a correlation matrix from a very large sparse matrix in python? Basically, I am looking for something like numpy.corrcoef
that will work on a scipy sparse matrix.
The Pearson Correlation coefficient can be computed in Python using corrcoef() method from Numpy. The input for this function is typically a matrix, say of size mxn , where: Each column represents the values of a random variable. Each row represents a single sample of n random variables.
Correlation coefficients quantify the association between variables or features of a dataset. These statistics are of high importance for science and technology, and Python has great tools that you can use to calculate them. SciPy, NumPy, and Pandas correlation methods are fast, comprehensive, and well-documented.
You can compute the correlation coefficients fairly straightforwardly from the covariance matrix like this:
import numpy as np
from scipy import sparse
def sparse_corrcoef(A, B=None):
if B is not None:
A = sparse.vstack((A, B), format='csr')
A = A.astype(np.float64)
n = A.shape[1]
# Compute the covariance matrix
rowsum = A.sum(1)
centering = rowsum.dot(rowsum.T.conjugate()) / n
C = (A.dot(A.T.conjugate()) - centering) / (n - 1)
# The correlation coefficients are given by
# C_{i,j} / sqrt(C_{i} * C_{j})
d = np.diag(C)
coeffs = C / np.sqrt(np.outer(d, d))
return coeffs
Check that it works OK:
# some smallish sparse random matrices
a = sparse.rand(100, 100000, density=0.1, format='csr')
b = sparse.rand(100, 100000, density=0.1, format='csr')
coeffs1 = sparse_corrcoef(a, b)
coeffs2 = np.corrcoef(a.todense(), b.todense())
print(np.allclose(coeffs1, coeffs2))
# True
The amount of memory required for computing the covariance matrix C
will be heavily dependent on the sparsity structure of A
(and B
, if given). For example, if A
is an (m, n)
matrix containing just a single column of non-zero values then C
will be an (n, n)
matrix containing all non-zero values. If n
is large then this could be very bad news in terms of memory consumption.
You do not need to introduce a large dense matrix. Just keep it sparse using Numpy:
import numpy as np
def sparse_corr(A):
N = A.shape[0]
C=((A.T*A -(sum(A).T*sum(A)/N))/(N-1)).todense()
V=np.sqrt(np.mat(np.diag(C)).T*np.mat(np.diag(C)))
COR = np.divide(C,V+1e-119)
return COR
A = sparse.rand(1000000, 100, density=0.1, format='csr')
sparse_corr(A)
I present an answer for a scipy sparse matrix which runs in parallel. Rather than returning a giant correlation matrix, this returns a feature mask of fields to keep after checking all fields for both positive and negative Pearson correlations.
I also try to minimize calculations using the following strategy:
This might be sped up further by keeping a global list of columns marked for removal and skipping further correlation calculations for such columns, since columns will execute out of order. However, I do not know enough about race conditions in python to implement this tonight.
Returning a column mask will obviously allow the code to handle much larger datasets than returning the entire correlation matrix.
Check each column using this function:
def get_corr_row(idx_num, sp_mat, thresh):
# slice the column at idx_num
cols = sp_mat.shape[1]
x = sp_mat[:,idx_num].toarray().ravel()
start = idx_num + 1
# Now slice each column to the right of idx_num
for i in range(start, cols):
y = sp_mat[:,i].toarray().ravel()
# Check the pearson correlation
corr, pVal = pearsonr(x,y)
# Pearson ranges from -1 to 1.
# We check both positive and negative correlations >= thresh using abs(corr)
if abs(corr) >= thresh:
# stop checking after finding the 1st correlation > thresh
return False
# Mark column at idx_num for removal in the mask
return True
Run the column level correlation checks in parallel:
from joblib import Parallel, delayed
import multiprocessing
def Get_Corr_Mask(sp_mat, thresh, n_jobs=-1):
# we must make sure the matrix is in csc format
# before we start doing all these column slices!
sp_mat = sp_mat.tocsc()
cols = sp_mat.shape[1]
if n_jobs == -1:
# Process the work on all available CPU cores
num_cores = multiprocessing.cpu_count()
else:
# Process the work on the specified number of CPU cores
num_cores = n_jobs
# Return a mask of all columns to keep by calling get_corr_row()
# once for each column in the matrix
return Parallel(n_jobs=num_cores, verbose=5)(delayed(get_corr_row)(i, sp_mat, thresh)for i in range(cols))
General Usage:
#Get the mask using your sparse matrix and threshold.
corr_mask = Get_Corr_Mask(X_t_fpr, 0.95)
# Remove features that are >= 95% correlated
X_t_fpr_corr = X_t_fpr[:,corr_mask]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With