Calculate correlation between all columns of a DataFrame and all columns of another DataFrame?

Tags:

I have a DataFrame object stocks filled with stock returns. I have another DataFrame object industries filled with industry returns. I want to find each stock's correlation with each industry.

Click to copy

import numpy as np
np.random.seed(123)

df1=pd.DataFrame( {'s1':np.random.randn(10000), 's2':np.random.randn(10000) } )
df2=pd.DataFrame( {'i1':np.random.randn(10000), 'i2':np.random.randn(10000) } )

The expensive way to do this is to merge the two DataFrame objects, calculate correlation, and then throw out all the stock to stock and industry to industry correlations. Is there a more efficient way to do this?

947

asked Mar 08 '15 21:03

Deets McGeets

3 Answers

And here's a one-liner that uses apply on the columns and avoids the nested for loops. The main benefit is that apply builds the result in a DataFrame.

Click to copy

df1.apply(lambda s: df2.corrwith(s))

answered Oct 07 '22 12:10

ytsaig

Here's a slightly simpler answer than @JohnE's that uses pandas natively instead of using numpy.corrcoef. As an added bonus, you don't have to retrieve the correlation value out of a silly 2x2 correlation matrix, because pandas's series-to-series correlation function simply returns a number, not a matrix.

Click to copy

for s in ['s1','s2']:
    for i in ['i1','i2']:
        print df1[s].corr(df2[i])

answered Oct 07 '22 14:10

failwhale

Edit to add: I'll leave this answer for posterity but would recommend the later answers. In particular, use @ytsaig's if you want the simplest answer but use @failwhales's if you want a faster answer (seems to be about 5x faster than @ytsaig's in some quick timings I did using the data in the OP and about the same speed as mine).

Original answer: You could go with numpy.corrcoef() which is basically the same as corr in pandas, but the syntax may be more amenable to what you want.

Click to copy

for s in ['s1','s2']:
    for i in ['i1','i2']:
        print( 'corrcoef',s,i,np.corrcoef(df1[s],df2[i])[0,1] )

That prints:

Click to copy

corrcoef s1 i1 -0.00416977553597
corrcoef s1 i2 -0.0096393047035
corrcoef s2 i1 -0.026278689352
corrcoef s2 i2 -0.00402030582064

Alternatively you could load the results into a dataframe with appropriate labels:

Click to copy

cc = pd.DataFrame()     
for s in ['s1','s2']:
    for i in ['i1','i2']:
        cc = cc.append( pd.DataFrame(
             { 'corrcoef':np.corrcoef(df1[s],df2[i])[0,1] }, index=[s+'_'+i]))

Which looks like this:

Click to copy

       corrcoef
s1_i1 -0.004170
s1_i2 -0.009639
s2_i1 -0.026279
s2_i2 -0.004020

answered Oct 07 '22 13:10

JohnE

Related questions
                            
                                Flask Python Model Validation
                            
                                assertRaises fails, even the callable raises the required exception (python, unitest)
                            
                                Python: iterating over list vs over dict items efficiency
                            
                                SqlAlchemy metaclass confusion
                            
                                Convert an integer to binary without using the built-in bin function
                            
                                Can I make matplotlib sliders more discrete?
                            
                                Sending a password over SSH or SCP with subprocess.Popen
                            
                                Generate correlated data in Python (3.3)
                            
                                How to join all the lines together in a text file in python?
                            
                                Python installation in Mac OS X virtual environment that includes a framework that I can include into Xcode?
                            
                                how to use a Python function with keyword "self" in arguments
                            
                                Installing win32gui python module [duplicate]
                            
                                Is a countvectorizer the same as tfidfvectorizer with use_idf=false?
                            
                                Embedding Python3 in Qt 5
                            
                                Calculating cumulative minimum with numpy arrays
                            
                                How to properly escape strings when manually building SQL queries in SQLAlchemy?
                            
                                Determine what project id my App Engine code is running on
                            
                                how to set autocommit = 1 in a sqlalchemy.engine.Connection
                            
                                'str' does not support the buffer interface Python3 from Python2
                            
                                Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculate correlation between all columns of a DataFrame and all columns of another DataFrame?

Tags:

python

python-3.x

pandas

Deets McGeets

People also ask

3 Answers

ytsaig

failwhale

JohnE

Recent Activity

Donate For Us