Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Computing the correlation coefficient between two multi-dimensional arrays

I have two arrays that have the shapes N X T and M X T. I'd like to compute the correlation coefficient across T between every possible pair of rows n and m (from N and M, respectively).

What's the fastest, most pythonic way to do this? (Looping over N and M would seem to me to be neither fast nor pythonic.) I'm expecting the answer to involve numpy and/or scipy. Right now my arrays are numpy arrays, but I'm open to converting them to a different type.

I'm expecting my output to be an array with the shape N X M.

N.B. When I say "correlation coefficient," I mean the Pearson product-moment correlation coefficient.

Here are some things to note:

  • The numpy function correlate requires input arrays to be one-dimensional.
  • The numpy function corrcoef accepts two-dimensional arrays, but they must have the same shape.
  • The scipy.stats function pearsonr requires input arrays to be one-dimensional.
like image 713
dbliss Avatar asked May 09 '15 18:05

dbliss


People also ask

How do you find the correlation coefficient between two matrices?

R = corrcoef( A ) returns the matrix of correlation coefficients for A , where the columns of A represent random variables and the rows represent observations. R = corrcoef( A , B ) returns coefficients between two random variables A and B .

What is a correlation matrix?

A correlation matrix is simply a table which displays the correlation coefficients for different variables. The matrix depicts the correlation between all the possible pairs of values in a table. It is a powerful tool to summarize a large dataset and to identify and visualize patterns in the given data.

What is NP Corrcoef?

In NumPy, We can compute pearson product-moment correlation coefficients of two given arrays with the help of numpy. corrcoef() function. In this function, we will pass arrays as a parameter and it will return the pearson product-moment correlation coefficients of two given arrays.


1 Answers

Correlation (default 'valid' case) between two 2D arrays:

You can simply use matrix-multiplication np.dot like so -

out = np.dot(arr_one,arr_two.T) 

Correlation with the default "valid" case between each pairwise row combinations (row1,row2) of the two input arrays would correspond to multiplication result at each (row1,row2) position.


Row-wise Correlation Coefficient calculation for two 2D arrays:

def corr2_coeff(A, B):     # Rowwise mean of input arrays & subtract from input arrays themeselves     A_mA = A - A.mean(1)[:, None]     B_mB = B - B.mean(1)[:, None]      # Sum of squares across rows     ssA = (A_mA**2).sum(1)     ssB = (B_mB**2).sum(1)      # Finally get corr coeff     return np.dot(A_mA, B_mB.T) / np.sqrt(np.dot(ssA[:, None],ssB[None])) 

This is based upon this solution to How to apply corr2 functions in Multidimentional arrays in MATLAB

Benchmarking

This section compares runtime performance with the proposed approach against generate_correlation_map & loopy pearsonr based approach listed in the other answer.(taken from the function test_generate_correlation_map() without the value correctness verification code at the end of it). Please note the timings for the proposed approach also include a check at the start to check for equal number of columns in the two input arrays, as also done in that other answer. The runtimes are listed next.

Case #1:

In [106]: A = np.random.rand(1000, 100)  In [107]: B = np.random.rand(1000, 100)  In [108]: %timeit corr2_coeff(A, B) 100 loops, best of 3: 15 ms per loop  In [109]: %timeit generate_correlation_map(A, B) 100 loops, best of 3: 19.6 ms per loop 

Case #2:

In [110]: A = np.random.rand(5000, 100)  In [111]: B = np.random.rand(5000, 100)  In [112]: %timeit corr2_coeff(A, B) 1 loops, best of 3: 368 ms per loop  In [113]: %timeit generate_correlation_map(A, B) 1 loops, best of 3: 493 ms per loop 

Case #3:

In [114]: A = np.random.rand(10000, 10)  In [115]: B = np.random.rand(10000, 10)  In [116]: %timeit corr2_coeff(A, B) 1 loops, best of 3: 1.29 s per loop  In [117]: %timeit generate_correlation_map(A, B) 1 loops, best of 3: 1.83 s per loop 

The other loopy pearsonr based approach seemed too slow, but here are the runtimes for one small datasize -

In [118]: A = np.random.rand(1000, 100)  In [119]: B = np.random.rand(1000, 100)  In [120]: %timeit corr2_coeff(A, B) 100 loops, best of 3: 15.3 ms per loop  In [121]: %timeit generate_correlation_map(A, B) 100 loops, best of 3: 19.7 ms per loop  In [122]: %timeit pearsonr_based(A, B) 1 loops, best of 3: 33 s per loop 
like image 70
Divakar Avatar answered Oct 11 '22 17:10

Divakar