Correlation coefficients and p values for all pairs of rows of a matrix

Tags:

I have a matrix data with m rows and n columns. I used to compute the correlation coefficients between all pairs of rows using np.corrcoef:

import numpy as np
data = np.array([[0, 1, -1], [0, -1, 1]])
np.corrcoef(data)

Now I would also like to have a look at the p-values of these coefficients. np.corrcoef doesn't provide these; scipy.stats.pearsonr does. However, scipy.stats.pearsonr does not accept a matrix on input.

Is there a quick way how to compute both the coefficient and the p-value for all pairs of rows (arriving e.g. at two m by m matrices, one with correlation coefficients, the other with corresponding p-values) without having to manually go through all pairs?

611

asked Jun 26 '14 13:06

John Manak

2 Answers

I have encountered the same problem today.

After half an hour of googling, I can't find any code in numpy/scipy library can help me do this.

So I wrote my own version of corrcoef

import numpy as np
from scipy.stats import pearsonr, betai

def corrcoef(matrix):
    r = np.corrcoef(matrix)
    rf = r[np.triu_indices(r.shape[0], 1)]
    df = matrix.shape[1] - 2
    ts = rf * rf * (df / (1 - rf * rf))
    pf = betai(0.5 * df, 0.5, df / (df + ts))
    p = np.zeros(shape=r.shape)
    p[np.triu_indices(p.shape[0], 1)] = pf
    p[np.tril_indices(p.shape[0], -1)] = p.T[np.tril_indices(p.shape[0], -1)]
    p[np.diag_indices(p.shape[0])] = np.ones(p.shape[0])
    return r, p

def corrcoef_loop(matrix):
    rows, cols = matrix.shape[0], matrix.shape[1]
    r = np.ones(shape=(rows, rows))
    p = np.ones(shape=(rows, rows))
    for i in range(rows):
        for j in range(i+1, rows):
            r_, p_ = pearsonr(matrix[i], matrix[j])
            r[i, j] = r[j, i] = r_
            p[i, j] = p[j, i] = p_
    return r, p

The first version use the result of np.corrcoef, and then calculate p-value based on triangle-upper values of corrcoef matrix.

The second loop version just iterating over rows, do pearsonr manually.

def test_corrcoef():
    a = np.array([
        [1, 2, 3, 4],
        [1, 3, 1, 4],
        [8, 3, 8, 5],
        [2, 3, 2, 1]])

    r1, p1 = corrcoef(a)
    r2, p2 = corrcoef_loop(a)

    assert np.allclose(r1, r2)
    assert np.allclose(p1, p2)

The test passed, they are the same.

def test_timing():
    import time
    a = np.random.randn(100, 2500)

    def timing(func, *args, **kwargs):
        t0 = time.time()
        loops = 10
        for _ in range(loops):
            func(*args, **kwargs)
        print('{} takes {} seconds loops={}'.format(
            func.__name__, time.time() - t0, loops))

    timing(corrcoef, a)
    timing(corrcoef_loop, a)


if __name__ == '__main__':
    test_corrcoef()
    test_timing()

The performance on my Macbook against 100x2500 matrix

corrcoef takes 0.06608104705810547 seconds loops=10

corrcoef_loop takes 7.585600137710571 seconds loops=10

187

answered Oct 20 '22 13:10

jingchao

The most consice way of doing it might be the buildin method .corr in pandas, to get r:

In [79]:

import pandas as pd
m=np.random.random((6,6))
df=pd.DataFrame(m)
print df.corr()
          0         1         2         3         4         5
0  1.000000 -0.282780  0.455210 -0.377936 -0.850840  0.190545
1 -0.282780  1.000000 -0.747979 -0.461637  0.270770  0.008815
2  0.455210 -0.747979  1.000000 -0.137078 -0.683991  0.557390
3 -0.377936 -0.461637 -0.137078  1.000000  0.511070 -0.801614
4 -0.850840  0.270770 -0.683991  0.511070  1.000000 -0.499247
5  0.190545  0.008815  0.557390 -0.801614 -0.499247  1.000000

To get p values using t-test:

In [84]:

n=6
r=df.corr()
t=r*np.sqrt((n-2)/(1-r*r))

import scipy.stats as ss
ss.t.cdf(t, n-2)
Out[84]:
array([[ 1.        ,  0.2935682 ,  0.817826  ,  0.23004382,  0.01585695,
         0.64117917],
       [ 0.2935682 ,  1.        ,  0.04363408,  0.17836685,  0.69811422,
         0.50661121],
       [ 0.817826  ,  0.04363408,  1.        ,  0.39783538,  0.06700715,
         0.8747497 ],
       [ 0.23004382,  0.17836685,  0.39783538,  1.        ,  0.84993082,
         0.02756579],
       [ 0.01585695,  0.69811422,  0.06700715,  0.84993082,  1.        ,
         0.15667393],
       [ 0.64117917,  0.50661121,  0.8747497 ,  0.02756579,  0.15667393,
         1.        ]])
In [85]:

ss.pearsonr(m[:,0], m[:,1])
Out[85]:
(-0.28277983892175751, 0.58713640696703184)
In [86]:
#be careful about the difference of 1-tail test and 2-tail test:
0.58713640696703184/2
Out[86]:
0.2935682034835159 #the value in ss.t.cdf(t, n-2) [0,1] cell

Also you can just use the scipy.stats.pearsonr you mentioned in OP:

In [95]:
#returns a list of tuples of (r, p, index1, index2)
import itertools
[ss.pearsonr(m[:,i],m[:,j])+(i, j) for i, j in itertools.product(range(n), range(n))]
Out[95]:
[(1.0, 0.0, 0, 0),
 (-0.28277983892175751, 0.58713640696703184, 0, 1),
 (0.45521036266021014, 0.36434799921123057, 0, 2),
 (-0.3779357902414715, 0.46008763115463419, 0, 3),
 (-0.85083961671703368, 0.031713908656676448, 0, 4),
 (0.19054495489542525, 0.71764166168348287, 0, 5),
 (-0.28277983892175751, 0.58713640696703184, 1, 0),
 (1.0, 0.0, 1, 1),
#etc, etc

answered Oct 20 '22 11:10

CT Zhu

Related questions
                            
                                python argparse - optional append argument with choices
                            
                                How can I include special characters (tab, newline) in a python doctest result string?
                            
                                Is there a numpy "max minus min" function?
                            
                                how enable requests async mode?
                            
                                Python multiple inheritance function overriding and ListView in django
                            
                                How to set a charset in email using smtplib in Python 2.7?
                            
                                Python modules with identical names (i.e., reusing standard module names in packages)
                            
                                Github-Flavored-Markdown in Python
                            
                                Eclipse Pydev: Run selected lines of code
                            
                                Subclassing Python's `property`
                            
                                How to write/read pandas Series to/from csv?
                            
                                ZeroMQ and multiple subscribe filters in Python
                            
                                Pip packages not found - Brewed Python
                            
                                dtypes. Difference between S1 and S2 in Python
                            
                                Python converting *args to list
                            
                                Python - is there a way to implement __getitem__ for multidimension array?
                            
                                numpy/scipy analog of matlab's fminsearch
                            
                                Can't send input to running program in Sublime Text
                            
                                OpenCV Image Resize flips dimensions?
                            
                                Enforcing Class Variables in a Subclass

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Correlation coefficients and p values for all pairs of rows of a matrix

Tags:

python

numpy

statistics

scipy

correlation

John Manak

People also ask

2 Answers

jingchao

CT Zhu

Recent Activity

Donate For Us