Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas correlation matrix between each pair groupby item

I have a csv file like this:

date,sym,close
2014.01.01,A,10
2014.01.02,A,11
2014.01.03,A,12
2014.01.04,A,13
2014.01.01,B,20
2014.01.02,B,22
2014.01.03,B,23
2014.01.01,C,33
2014.01.02,C,32
2014.01.03,C,31

Then, I get a dateframe named df via read_csv function

import numpy as np
import pandas as pd
df=pd.read_csv('daily.csv',index_col=[0])
groups=df.groupby('sym')[['close']].apply(lambda x:func(x['close'].values))

The groups look like this:

sym
A    [nan,1.00,2.00,...]
B    [nan,1.00,2.00,...]
C    [nan,1.00,2.00,...]

How to calculate the correlation between each pair of sym?

AA,AB,AC,BB,BA,BC,CA,CB,CC

BTW, the item numbers of each sym may be NOT the same.

like image 551
seizetheday Avatar asked Apr 14 '15 15:04

seizetheday


People also ask

How do you find the correlation between two variables with pandas?

Initialize two variables, col1 and col2, and assign them the columns that you want to find the correlation of. Find the correlation between col1 and col2 by using df[col1]. corr(df[col2]) and save the correlation value in a variable, corr. Print the correlation value, corr.

What does Corr () do in pandas?

corr() is used to find the pairwise correlation of all columns in the Pandas Dataframe in Python. Any NaN values are automatically excluded. Any non-numeric data type or columns in the Dataframe, it is ignored.

What is pairwise correlation?

Using pairwise correlation for feature selection is all about that — identifying groups of highly correlated features and only keeping one of them so that your model can have as much predictive power using as few features as possible .


1 Answers

With df as above, make a pivot table:

dfp = df.pivot('date','sym')
print(dfp)
           close        
sym            A   B   C
date                    
2014-01-01    10  20  33
2014-01-02    11  22  32
2014-01-03    12  23  31
2014-01-04    13 NaN  30

pandas will calculate the pairwise coefficients:

print(dfp.corr())
              close                    
sym               A         B         C
      sym                              
close A    1.000000  0.981981 -1.000000
      B    0.981981  1.000000 -0.981981
      C   -1.000000 -0.981981  1.000000

But if you want to prettify it, check out seaborn:

import seaborn as sns
sns.corrplot(dfp, annot=True)

result:

enter image description here

like image 164
cphlewis Avatar answered Oct 01 '22 02:10

cphlewis