Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to calculate correlation between rows in python pandas data frame

I have large data frame, and I need to calculate efficiently correlation between the data frame rows and given value list. for example:

dfa= DataFrame(np.zeros((1,4)) ,columns=['a','b','c','d'])
dfa.ix[0] = [2,6,8,12]
a   b   c   d
2.0 6.0 8.0 12.0
dfb= DataFrame([[2,6,8,12],[1,3,4,6],[-1,-3,-4,-6]], columns=['a','b','c','d'])
    a   b   c   d
0   2   6   8   12
1   1   3   4   6
2  -1  -3  -4  -6

I expect to get:

0    1
1    0.5
2   -0.5

I tried many version, for example:

dfb.T.corrwith(dfa.T, axis=0)

But ll I get is a lot of Nan's

like image 340
Naomi Fridman Avatar asked Nov 02 '17 12:11

Naomi Fridman


2 Answers

First of all, note that the last 2 correlations are 1 and -1 and not 0.5 and -0.5 as you expected.

Solution

dfb.corrwith(dfa.iloc[0], axis=1)

Results

0    1.0
1    1.0
2   -1.0
dtype: float64
like image 174
seralouk Avatar answered Nov 14 '22 23:11

seralouk


I think the number that you are trying to get is not correlation coefficient actually. The correlation between 1st and second row is 1 not 0.5. Correlation is a measure of linear relationship between variables. Here the two lists are strongly correlated with pearson's coefficient 1. If you plot row0 [2,6,8,12] against row1 [1,3,4,6] they all lie on a single line. Mean while if you want to find correlation between rows this should work:

NOTE: the correct correlation is [1,1,-1]

pd.DataFrame(dfb.transpose()).corr()

like image 27
Yogesh Avatar answered Nov 14 '22 22:11

Yogesh