Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate p-values for pairwise correlation of columns in Pandas?

Pandas has the very handy function to do pairwise correlation of columns using pd.corr(). That means it is possible to compare correlations between columns of any length. For instance:

df = pd.DataFrame(np.random.randint(0,100,size=(100, 10)))

     0   1   2   3   4   5   6   7   8   9
0    9  17  55  32   7  97  61  47  48  46
1    8  83  87  56  17  96  81   8  87   0
2   60  29   8  68  56  63  81   5  24  52
3   42  76   6  75   7  59  19  17   3  63
...

Now it is possible to test correlation between all 10 columns with df.corr(method='pearson'):

      0         1         2         3         4         5         6         7         8         9
0  1.000000  0.082789 -0.094096 -0.086091  0.163091  0.013210  0.167204 -0.002514  0.097481  0.091020
1  0.082789  1.000000  0.027158 -0.080073  0.056364 -0.050978 -0.018428 -0.014099 -0.135125 -0.043797
2 -0.094096  0.027158  1.000000 -0.102975  0.101597 -0.036270  0.202929  0.085181  0.093723 -0.055824
3 -0.086091 -0.080073 -0.102975  1.000000 -0.149465  0.033130 -0.020929  0.183301 -0.003853 -0.062889
4  0.163091  0.056364  0.101597 -0.149465  1.000000 -0.007567 -0.017212 -0.086300  0.177247 -0.008612
5  0.013210 -0.050978 -0.036270  0.033130 -0.007567  1.000000 -0.080148 -0.080915 -0.004612  0.243713
6  0.167204 -0.018428  0.202929 -0.020929 -0.017212 -0.080148  1.000000  0.135348  0.070330  0.008170
7 -0.002514 -0.014099  0.085181  0.183301 -0.086300 -0.080915  0.135348  1.000000 -0.114413 -0.111642
8  0.097481 -0.135125  0.093723 -0.003853  0.177247 -0.004612  0.070330 -0.114413  1.000000 -0.153564
9  0.091020 -0.043797 -0.055824 -0.062889 -0.008612  0.243713  0.008170 -0.111642 -0.153564  1.000000

Is there a simple way to also get the corresponding p-values (ideally in pandas), as it is returned e.g. by scipy's kendalltau()?

like image 300
n1000 Avatar asked Oct 10 '18 13:10

n1000


3 Answers

Why not using the "method" argument of pandas.DataFrame.corr():

  • pearson : standard correlation coefficient.
  • kendall : Kendall Tau correlation coefficient.
  • spearman : Spearman rank correlation.
  • callable: callable with input two 1d ndarrays and returning a float.
from scipy.stats import kendalltau, pearsonr, spearmanr

    def kendall_pval(x,y):
        return kendalltau(x,y)[1]
    
    def pearsonr_pval(x,y):
        return pearsonr(x,y)[1]
    
    def spearmanr_pval(x,y):
        return spearmanr(x,y)[1]

and then

corr = df.corr(method=pearsonr_pval)
like image 171
Ramon Dalmau Avatar answered Nov 05 '22 15:11

Ramon Dalmau


Probably just loop. It's basically what pandas does in the source code to generate the correlation matrix anyway:

import pandas as pd
import numpy as np
from scipy import stats

df_corr = pd.DataFrame() # Correlation matrix
df_p = pd.DataFrame()  # Matrix of p-values
for x in df.columns:
    for y in df.columns:
        corr = stats.pearsonr(df[x], df[y])
        df_corr.loc[x,y] = corr[0]
        df_p.loc[x,y] = corr[1]

If you want to leverage the fact that this is symmetric, so you only need to calculate this for roughly half of them, then do:

mat = df.values.T
K = len(df.columns)
correl = np.empty((K,K), dtype=float)
p_vals = np.empty((K,K), dtype=float)

for i, ac in enumerate(mat):
    for j, bc in enumerate(mat):
        if i > j:
            continue
        else:
            corr = stats.pearsonr(ac, bc)
            #corr = stats.kendalltau(ac, bc)

        correl[i,j] = corr[0]
        correl[j,i] = corr[0]
        p_vals[i,j] = corr[1]
        p_vals[j,i] = corr[1]

df_p = pd.DataFrame(p_vals)
df_corr = pd.DataFrame(correl)
#pd.concat([df_corr, df_p], keys=['corr', 'p_val'])
like image 22
ALollz Avatar answered Nov 05 '22 15:11

ALollz


This will work:

from scipy.stats import pearsonr

column_values = [column for column in df.columns.tolist() ]


df['Correlation_coefficent'], df['P-value'] = zip(*df.T.apply(lambda x: pearsonr(x[column_values ],x[column_values ])))
df_result = df[['Correlation_coefficent','P-value']]
like image 41
Rahul Agarwal Avatar answered Nov 05 '22 15:11

Rahul Agarwal