Efficient pairwise correlation for two matrices of features

Tags:

In Python I need to find the pairwise correlation between all features in a matrix A and all features in a matrix B. In particular, I am interesting in finding the strongest Pearson correlation that a given feature in A has across all features in B. I do not care whether the strongest correlation is positive or negative.

I've done a inefficient implementation using two loops and scipy below. However, I'd like to use np.corrcoef or another similar method to compute it efficiently. Matrix A has shape 40000x400 and B has shape 40000x1440. My attempt at doing it efficiently can be seen below as the method find_max_absolute_corr(A,B). However, it fails with the following error:

ValueError: all the input array dimensions except for the concatenation axis must match exactly.

import numpy as np
from scipy.stats import pearsonr


def find_max_absolute_corr(A, B):
    """ Finds for each feature in `A` the highest Pearson
        correlation across all features in `B`. """

    max_corr_A = np.zeros((A.shape[1]))    

    for A_col in range(A.shape[1]):
        print "Calculating {}/{}.".format(A_col+1, A.shape[1])

        metric = A[:,A_col]
        pearson = np.corrcoef(B, metric, rowvar=0)

        # takes negative correlations into account as well
        min_p = min(pearson)
        max_p = max(pearson)
        max_corr_A[A_col] = max_absolute(min_p, max_p)

    return max_corr_A


def max_absolute(min_p, max_p):
    if np.isnan(min_p) or np.isnan(max_p):
        raise ValueError("NaN correlation.")
    if abs(max_p) > abs(min_p):
        return max_p
    else:
        return min_p


if __name__ == '__main__':

    A = np.array(
        [[10, 8.04, 9.14, 7.46],
         [8, 6.95, 8.14, 6.77],
         [13, 7.58, 8.74, 12.74],
         [9, 8.81, 8.77, 7.11],
         [11, 8.33, 9.26, 7.81]])

    B = np.array(
        [[-14, -9.96, 8.10, 8.84, 8, 7.04], 
         [-6, -7.24, 6.13, 6.08, 5, 5.25], 
         [-4, -4.26, 3.10, 5.39, 8, 5.56], 
         [-12, -10.84, 9.13, 8.15, 5, 7.91], 
         [-7, -4.82, 7.26, 6.42, 8, 6.89]])

    # simple, inefficient method
    for A_col in range(A.shape[1]): 
        high_corr = 0
        for B_col in range(B.shape[1]):
            corr,_ = pearsonr(A[:,A_col], B[:,B_col])
            high_corr = max_absolute(high_corr, corr)
        print high_corr

    # -0.161314601631
    # 0.956781516149
    # 0.621071009239
    # -0.421539304112        

    # efficient method
    max_corr_A = find_max_absolute_corr(A, B)
    print max_corr_A

    # [-0.161314601631,
    # 0.956781516149,
    # 0.621071009239,
    # -0.421539304112]

565

asked Nov 11 '15 12:11

pir

2 Answers

Seems scipy.stats.pearsonr follows this definition of Pearson Correlation Coefficient Formula applied on column-wise pairs from A & B -

enter image description here

Based on that formula, you can vectorized easily as the pairwise computations of columns from A and B are independent of each other. Here's one vectorized solution using broadcasting -

# Get number of rows in either A or B
N = B.shape[0]

# Store columnw-wise in A and B, as they would be used at few places
sA = A.sum(0)
sB = B.sum(0)

# Basically there are four parts in the formula. We would compute them one-by-one
p1 = N*np.einsum('ij,ik->kj',A,B)
p2 = sA*sB[:,None]
p3 = N*((B**2).sum(0)) - (sB**2)
p4 = N*((A**2).sum(0)) - (sA**2)

# Finally compute Pearson Correlation Coefficient as 2D array 
pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))

# Get the element corresponding to absolute argmax along the columns 
out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])]

Sample run -

1) Inputs :

In [12]: A
Out[12]: 
array([[ 10.  ,   8.04,   9.14,   7.46],
       [  8.  ,   6.95,   8.14,   6.77],
       [ 13.  ,   7.58,   8.74,  12.74],
       [  9.  ,   8.81,   8.77,   7.11],
       [ 11.  ,   8.33,   9.26,   7.81]])

In [13]: B
Out[13]: 
array([[-14.  ,  -9.96,   8.1 ,   8.84,   8.  ,   7.04],
       [ -6.  ,  -7.24,   6.13,   6.08,   5.  ,   5.25],
       [ -4.  ,  -4.26,   3.1 ,   5.39,   8.  ,   5.56],
       [-12.  , -10.84,   9.13,   8.15,   5.  ,   7.91],
       [ -7.  ,  -4.82,   7.26,   6.42,   8.  ,   6.89]])

2) Original loopy code run -

In [14]: high_corr_out = np.zeros(A.shape[1])
    ...: for A_col in range(A.shape[1]): 
    ...:     high_corr = 0
    ...:     for B_col in range(B.shape[1]):
    ...:         corr,_ = pearsonr(A[:,A_col], B[:,B_col])
    ...:         high_corr = max_absolute(high_corr, corr)
    ...:     high_corr_out[A_col] = high_corr
    ...:     

In [15]: high_corr_out
Out[15]: array([ 0.8067843 ,  0.95678152,  0.74016181, -0.85127779])

3) Proposed code run -

In [16]: N = B.shape[0]
    ...: sA = A.sum(0)
    ...: sB = B.sum(0)
    ...: p1 = N*np.einsum('ij,ik->kj',A,B)
    ...: p2 = sA*sB[:,None]
    ...: p3 = N*((B**2).sum(0)) - (sB**2)
    ...: p4 = N*((A**2).sum(0)) - (sA**2)
    ...: pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))
    ...: out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])]
    ...: 

In [17]: pcorr # Pearson Correlation Coefficient array
Out[17]: 
array([[ 0.41895565, -0.5910935 , -0.40465987,  0.5818286 ],
       [ 0.66609445, -0.41950457,  0.02450215,  0.64028344],
       [-0.64953314,  0.65669916,  0.30836196, -0.85127779],
       [-0.41917583,  0.59043266,  0.40364532, -0.58144102],
       [ 0.8067843 ,  0.07947386,  0.74016181,  0.53165395],
       [-0.1613146 ,  0.95678152,  0.62107101, -0.4215393 ]])

In [18]: out # elements corresponding to absolute argmax along columns
Out[18]: array([ 0.8067843 ,  0.95678152,  0.74016181, -0.85127779])

Runtime tests -

In [36]: A = np.random.rand(4000,40)

In [37]: B = np.random.rand(4000,144)

In [38]: np.allclose(org_app(A,B),proposed_app(A,B))
Out[38]: True

In [39]: %timeit org_app(A,B) # Original approach
1 loops, best of 3: 1.35 s per loop

In [40]: %timeit proposed_app(A,B) # Proposed vectorized approach
10 loops, best of 3: 39.1 ms per loop

168

answered Nov 10 '22 14:11

Divakar

Adding on to the above answer from personal experience,

p1 = N*np.dot(B.T,A)

worked way faster for me when compared to

p1 = N*np.einsum('ij,ik->kj',A,B)

This was especially true when A and B are large dimensional matrices.

answered Nov 10 '22 16:11

Aditya Goel

Related questions
                            
                                In Python, how can an image stored as a NumPy array be scaled in size?
                            
                                Scikit Learn - Calculating TF-IDF from a corpus of arrays of features instead of from a corpus of raw documents
                            
                                Import error no module named zlib (brew installed python)
                            
                                Python. How to get the x,y coordinates of a offset spline from a x,y list of points and offset distance
                            
                                Django override bulk_create
                            
                                Python: Assertion error, "not called"
                            
                                OpenCV's waitKey() alternative in IPython Notebook
                            
                                Psycopg2 - AttributeError: 'NoneType' object has no attribute 'fetchall'
                            
                                Querying Pandas DataFrame with column name that contains a space or using the drop method with a column name that contains a space
                            
                                An elegant way to make a 2d array with all possible columns
                            
                                how do I commit and push to github from python shell?
                            
                                In python, can you pass variadic arguments after named parameters?
                            
                                Preserve empty lines with NLTK's Punkt Tokenizer
                            
                                python pandas dataframe : removing selected rows
                            
                                Remove rotation effect when drawing a square grid of MxM nodes in networkx using grid_2d_graph
                            
                                How to get extended MacOS attributes of a file using python?
                            
                                Increase tkSimpleDialog window size
                            
                                Pandas dataframe apply refer to previous row to calculate difference
                            
                                django 1.8- if form entry query result does't match database, display alert message on same page, instead of "None" or raise exception page
                            
                                Why NLTK lemmatization has wrong output even if verb.exc has added right value?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient pairwise correlation for two matrices of features

Tags:

python

vectorization

matrix

numpy

correlation

pir

People also ask

2 Answers

Divakar

Aditya Goel

Recent Activity

Donate For Us