Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate correlation between all columns of a DataFrame and all columns of another DataFrame?

I have a DataFrame object stocks filled with stock returns. I have another DataFrame object industries filled with industry returns. I want to find each stock's correlation with each industry.

import numpy as np
np.random.seed(123)

df1=pd.DataFrame( {'s1':np.random.randn(10000), 's2':np.random.randn(10000) } )
df2=pd.DataFrame( {'i1':np.random.randn(10000), 'i2':np.random.randn(10000) } )

The expensive way to do this is to merge the two DataFrame objects, calculate correlation, and then throw out all the stock to stock and industry to industry correlations. Is there a more efficient way to do this?

like image 947
Deets McGeets Avatar asked Mar 08 '15 21:03

Deets McGeets


People also ask

What is DF Corr ()?

The corr() method finds the correlation of each column in a DataFrame.

What is correlation between columns pandas?

It is denoted by r and values between -1 and +1. A positive value for r indicates a positive association, and a negative value for r indicates a negative association. By using corr() function we can get the correlation between two columns in the dataframe.

How do you find the correlation between two data frames?

corrwith() is used to compute pairwise correlation between rows or columns of two DataFrame objects. If the shape of two dataframe object is not same then the corresponding correlation value will be a NaN value. Note: The correlation of a variable with itself is 1.

How to get the correlation between all the columns of a Dataframe?

You can also get the correlation between all the columns of a dataframe. For this, apply the corr () function on the entire dataframe which will result in a dataframe of pair-wise correlation values between all the columns. When applied to an entire dataframe, the corr () function returns a dataframe of pair-wise correlation between the columns.

How do you find the correlation between two columns in MATLAB?

2. Correlation between all the columns of a dataframe You can also get the correlation between all the columns of a dataframe. For this, apply the corr () function on the entire dataframe which will result in a dataframe of pair-wise correlation values between all the columns.

What does it mean when the two columns of a Dataframe?

This indicates that the two columns highly correlated in a positive direction. That is, for a higher value in Maths we are observing a higher value in Physics and vice versa. 2. Correlation between all the columns of a dataframe

How to find the pairwise correlation of all columns in pandas?

Pandas dataframe.corr() is used to find the pairwise correlation of all columns in the dataframe. Any na values are automatically excluded. For any non-numeric data type columns in the dataframe it is ignored.


3 Answers

And here's a one-liner that uses apply on the columns and avoids the nested for loops. The main benefit is that apply builds the result in a DataFrame.

df1.apply(lambda s: df2.corrwith(s))
like image 52
ytsaig Avatar answered Oct 07 '22 12:10

ytsaig


Here's a slightly simpler answer than @JohnE's that uses pandas natively instead of using numpy.corrcoef. As an added bonus, you don't have to retrieve the correlation value out of a silly 2x2 correlation matrix, because pandas's series-to-series correlation function simply returns a number, not a matrix.

for s in ['s1','s2']:
    for i in ['i1','i2']:
        print df1[s].corr(df2[i])
like image 36
failwhale Avatar answered Oct 07 '22 14:10

failwhale


Edit to add: I'll leave this answer for posterity but would recommend the later answers. In particular, use @ytsaig's if you want the simplest answer but use @failwhales's if you want a faster answer (seems to be about 5x faster than @ytsaig's in some quick timings I did using the data in the OP and about the same speed as mine).

Original answer: You could go with numpy.corrcoef() which is basically the same as corr in pandas, but the syntax may be more amenable to what you want.

for s in ['s1','s2']:
    for i in ['i1','i2']:
        print( 'corrcoef',s,i,np.corrcoef(df1[s],df2[i])[0,1] )
   

That prints:

corrcoef s1 i1 -0.00416977553597
corrcoef s1 i2 -0.0096393047035
corrcoef s2 i1 -0.026278689352
corrcoef s2 i2 -0.00402030582064

Alternatively you could load the results into a dataframe with appropriate labels:

cc = pd.DataFrame()     
for s in ['s1','s2']:
    for i in ['i1','i2']:
        cc = cc.append( pd.DataFrame(
             { 'corrcoef':np.corrcoef(df1[s],df2[i])[0,1] }, index=[s+'_'+i]))

Which looks like this:

       corrcoef
s1_i1 -0.004170
s1_i2 -0.009639
s2_i1 -0.026279
s2_i2 -0.004020
like image 5
JohnE Avatar answered Oct 07 '22 13:10

JohnE