I have a csv file like this:
date,sym,close
2014.01.01,A,10
2014.01.02,A,11
2014.01.03,A,12
2014.01.04,A,13
2014.01.01,B,20
2014.01.02,B,22
2014.01.03,B,23
2014.01.01,C,33
2014.01.02,C,32
2014.01.03,C,31
Then, I get a dateframe named df
via read_csv
function
import numpy as np
import pandas as pd
df=pd.read_csv('daily.csv',index_col=[0])
groups=df.groupby('sym')[['close']].apply(lambda x:func(x['close'].values))
The groups
look like this:
sym
A [nan,1.00,2.00,...]
B [nan,1.00,2.00,...]
C [nan,1.00,2.00,...]
How to calculate the correlation between each pair of sym?
AA,AB,AC,BB,BA,BC,CA,CB,CC
BTW, the item numbers of each sym may be NOT the same.
Initialize two variables, col1 and col2, and assign them the columns that you want to find the correlation of. Find the correlation between col1 and col2 by using df[col1]. corr(df[col2]) and save the correlation value in a variable, corr. Print the correlation value, corr.
corr() is used to find the pairwise correlation of all columns in the Pandas Dataframe in Python. Any NaN values are automatically excluded. Any non-numeric data type or columns in the Dataframe, it is ignored.
Using pairwise correlation for feature selection is all about that — identifying groups of highly correlated features and only keeping one of them so that your model can have as much predictive power using as few features as possible .
With df
as above, make a pivot table:
dfp = df.pivot('date','sym')
print(dfp)
close sym A B C date 2014-01-01 10 20 33 2014-01-02 11 22 32 2014-01-03 12 23 31 2014-01-04 13 NaN 30
pandas will calculate the pairwise coefficients:
print(dfp.corr())
close sym A B C sym close A 1.000000 0.981981 -1.000000 B 0.981981 1.000000 -0.981981 C -1.000000 -0.981981 1.000000
But if you want to prettify it, check out seaborn
:
import seaborn as sns
sns.corrplot(dfp, annot=True)
result:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With