I have large data frame, and I need to calculate efficiently correlation between the data frame rows and given value list. for example:
dfa= DataFrame(np.zeros((1,4)) ,columns=['a','b','c','d'])
dfa.ix[0] = [2,6,8,12]
a b c d
2.0 6.0 8.0 12.0
dfb= DataFrame([[2,6,8,12],[1,3,4,6],[-1,-3,-4,-6]], columns=['a','b','c','d'])
a b c d
0 2 6 8 12
1 1 3 4 6
2 -1 -3 -4 -6
I expect to get:
0 1
1 0.5
2 -0.5
I tried many version, for example:
dfb.T.corrwith(dfa.T, axis=0)
But ll I get is a lot of Nan's
First of all, note that the last 2 correlations are 1 and -1 and not 0.5 and -0.5 as you expected.
Solution
dfb.corrwith(dfa.iloc[0], axis=1)
Results
0 1.0
1 1.0
2 -1.0
dtype: float64
I think the number that you are trying to get is not correlation coefficient actually. The correlation between 1st and second row is 1 not 0.5. Correlation is a measure of linear relationship between variables. Here the two lists are strongly correlated with pearson's coefficient 1. If you plot row0 [2,6,8,12] against row1 [1,3,4,6] they all lie on a single line. Mean while if you want to find correlation between rows this should work:
NOTE: the correct correlation is [1,1,-1]
pd.DataFrame(dfb.transpose()).corr()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With