Correlation coefficients for sparse matrix in python?

3 Answers

You can compute the correlation coefficients fairly straightforwardly from the covariance matrix like this:

import numpy as np
from scipy import sparse

def sparse_corrcoef(A, B=None):

    if B is not None:
        A = sparse.vstack((A, B), format='csr')

    A = A.astype(np.float64)
    n = A.shape[1]

    # Compute the covariance matrix
    rowsum = A.sum(1)
    centering = rowsum.dot(rowsum.T.conjugate()) / n
    C = (A.dot(A.T.conjugate()) - centering) / (n - 1)

    # The correlation coefficients are given by
    # C_{i,j} / sqrt(C_{i} * C_{j})
    d = np.diag(C)
    coeffs = C / np.sqrt(np.outer(d, d))

    return coeffs

Check that it works OK:

# some smallish sparse random matrices
a = sparse.rand(100, 100000, density=0.1, format='csr')
b = sparse.rand(100, 100000, density=0.1, format='csr')

coeffs1 = sparse_corrcoef(a, b)
coeffs2 = np.corrcoef(a.todense(), b.todense())

print(np.allclose(coeffs1, coeffs2))
# True

Be warned:

The amount of memory required for computing the covariance matrix C will be heavily dependent on the sparsity structure of A (and B, if given). For example, if A is an (m, n) matrix containing just a single column of non-zero values then C will be an (n, n) matrix containing all non-zero values. If n is large then this could be very bad news in terms of memory consumption.

171

answered Oct 11 '22 07:10

ali_m

You do not need to introduce a large dense matrix. Just keep it sparse using Numpy:

import numpy as np    
def sparse_corr(A):
    N = A.shape[0]
    C=((A.T*A -(sum(A).T*sum(A)/N))/(N-1)).todense()
    V=np.sqrt(np.mat(np.diag(C)).T*np.mat(np.diag(C)))
    COR = np.divide(C,V+1e-119)
    return COR

Testing the performance:

A = sparse.rand(1000000, 100, density=0.1, format='csr')
sparse_corr(A)

answered Oct 11 '22 07:10

Alt

I present an answer for a scipy sparse matrix which runs in parallel. Rather than returning a giant correlation matrix, this returns a feature mask of fields to keep after checking all fields for both positive and negative Pearson correlations.

I also try to minimize calculations using the following strategy:

Process each column
Start at the current column + 1 and calculate correlations moving to the right.
For any abs(correlation) >= threshold, mark the current column for removal and calculate no further correlations.
Perform these steps for each column in the dataset except the last.

This might be sped up further by keeping a global list of columns marked for removal and skipping further correlation calculations for such columns, since columns will execute out of order. However, I do not know enough about race conditions in python to implement this tonight.

Returning a column mask will obviously allow the code to handle much larger datasets than returning the entire correlation matrix.

Check each column using this function:

def get_corr_row(idx_num, sp_mat, thresh):
    # slice the column at idx_num
    cols = sp_mat.shape[1]
    x = sp_mat[:,idx_num].toarray().ravel()
    start = idx_num + 1
    
    # Now slice each column to the right of idx_num   
    for i in range(start, cols):
        y = sp_mat[:,i].toarray().ravel()
        # Check the pearson correlation
        corr, pVal = pearsonr(x,y)
        # Pearson ranges from -1 to 1.
        # We check both positive and negative correlations >= thresh using abs(corr)
        if abs(corr) >= thresh:
            # stop checking after finding the 1st correlation > thresh   
            return False
            # Mark column at idx_num for removal in the mask  
    return True

Run the column level correlation checks in parallel:

from joblib import Parallel, delayed  
import multiprocessing


def Get_Corr_Mask(sp_mat, thresh, n_jobs=-1):
    
    # we must make sure the matrix is in csc format 
    # before we start doing all these column slices!  
    sp_mat = sp_mat.tocsc()
    cols = sp_mat.shape[1]
    
    if n_jobs == -1:
        # Process the work on all available CPU cores
        num_cores = multiprocessing.cpu_count()
    else:
        # Process the work on the specified number of CPU cores
        num_cores = n_jobs

    # Return a mask of all columns to keep by calling get_corr_row() 
    # once for each column in the matrix     
    return Parallel(n_jobs=num_cores, verbose=5)(delayed(get_corr_row)(i, sp_mat, thresh)for i in range(cols))

General Usage:

#Get the mask using your sparse matrix and threshold.
corr_mask = Get_Corr_Mask(X_t_fpr, 0.95) 

# Remove features that are >= 95% correlated
X_t_fpr_corr = X_t_fpr[:,corr_mask]

answered Oct 11 '22 09:10

Jake Drew

Related questions
                            
                                How to remove "__main__." from the beginning of user-created exception classes in Python
                            
                                How to connect pyqtSignal between classes in PyQT
                            
                                A good way to encrypt database fields?
                            
                                Partial order sorting?
                            
                                Can I have an ellipsis at the beginning of the line in a Python doctest?
                            
                                Transfer layout from networkx to cytoscape
                            
                                Tornado AsyncHTTPClient fetch callback: Extra parameters?
                            
                                TypeError: cannot deepcopy this pattern object
                            
                                Is there a JavaScript (ECMAScript) implementation written in Python?
                            
                                Calling an overridden method, superclass an calls overridden method
                            
                                Is it bad form to raise ArgumentError by hand?
                            
                                Does Python have a similar control mechanism to Java's CountDownLatch?
                            
                                Passing additional arguments using scipy.optimize.curve_fit?
                            
                                Python context manager: conditionally executing body?
                            
                                How to embed a Python interpreter on a website
                            
                                Communication between two computers using python socket
                            
                                How to free memory after opening a file in Python
                            
                                python how to register dynamic class in module
                            
                                How to change Pandas dataframe index value?
                            
                                Python multiprocessing: TypeError: expected string or Unicode object, NoneType found

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Correlation coefficients for sparse matrix in python?

Tags:

python

numpy

scipy

sparse-matrix

correlation

user2855672

People also ask