I have a bunch of stock data, and I am trying to build a dataframe that takes the top two, and bottom stocks from a correlation matrix, and also their actual correlation.
Let's say the matrix, corr
looks like this:
A B C D E
A 1.00 0.65 0.31 0.94 0.55
B 0.87 1.00 0.96 0.67 0.41
C 0.95 0.88 1.00 0.72 0.69
D 0.64 0.84 0.99 1.00 0.78
E 0.71 0.62 0.89 0.32 1.00
What I want to do is to be able to is return the best two, and least correlated stocks, and their correlation, for Stock A, B, C, D & E, while dropping the obvious 1.00 correlation each stock has to itself.
The resulting dataframe, or whatever is easiest to display this is would look like this:
Stock 1st 1st_Val 2nd 2nd_Val Last Last_Val
A D 0.94 B 0.65 C 0.31
B C 0.96 A 0.87 E 0.41
C A 0.95 B 0.88 E 0.69
D C 0.99 B 0.84 A 0.64
E C 0.89 A 0.71 D 0.32
With my attempts so far I have been able to look through and return the relevant stock names using corr[stock].nlargest().index[0:].tolist()
, and then taking [1]
, [2]
and [-1]
from each list and sticking them in a dictionary and building the dataframe from there. But I'm unable to return the correlation value and I suspect I'm not doing this in the most efficient way anyway.
Any help really appreciated, cheers
Your conditions are hard to generalize into one command, but here is one approach you can take.
import numpy as np
np.fill_diagonal(corr.values, np.nan)
print(corr)
# A B C D E
#A NaN 0.65 0.31 0.94 0.55
#B 0.87 NaN 0.96 0.67 0.41
#C 0.95 0.88 NaN 0.72 0.69
#D 0.64 0.84 0.99 NaN 0.78
#E 0.71 0.62 0.89 0.32 NaN
You can use the answer on Find names of top-n highest-value columns in each pandas dataframe row to get the top 2 and bottom one value for each row (Stock).
order_top2 = np.argsort(-corr.values, axis=1)[:, :2]
order_bottom = np.argsort(corr.values, axis=1)[:, :1]
result_top2 = pd.DataFrame(
corr.columns[order_top2],
columns=['1st', '2nd'],
index=corr.index
)
result_bottom = pd.DataFrame(
corr.columns[order_bottom],
columns=['Last'],
index=corr.index
)
result = result_top2.join(result_bottom)
# 1st 2nd Last
#A D B C
#B C A E
#C A B E
#D C B A
#E C A D
Now use pandas.DataFrame.lookup
to grab the corresponding column value in corr
for each column in result
for x in result.columns:
result[x+"_Val"] = corr.lookup(corr.index, result[x])
print(result)
# 1st 2nd Last 1st_Val 2nd_Val Last_Val
#A D B C 0.94 0.65 0.31
#B C A E 0.96 0.87 0.41
#C A B E 0.95 0.88 0.69
#D C B A 0.99 0.84 0.64
#E C A D 0.89 0.71 0.32
print(result[['1st', '1st_Val', '2nd', '2nd_Val', 'Last', 'Last_Val']])
# 1st 1st_Val 2nd 2nd_Val Last Last_Val
#A D 0.94 B 0.65 C 0.31
#B C 0.96 A 0.87 E 0.41
#C A 0.95 B 0.88 E 0.69
#D C 0.99 B 0.84 A 0.64
#E C 0.89 A 0.71 D 0.32
If you need to visualize the results but you don't actually need to fetch and work with the actual correlation values, then why not using a very simple heatmap? You could also play with the plot to have the numbers displayed on each square.
import seaborn as sns
import pandas as pd
dict = {'Date':['2018-01-01','2018-01-02','2018-01-03','2018-01-04','2018-01-05'],'Col1':[1,2,3,4,5],'Col2':[1.1,1.2,1.3,1.4,1.5],'Col3':[0.33,0.98,1.54,0.01,0.99],'Col4':[8,9.98,6,0.01,0.1],'Col1':[19,42,3,0.4,51]}
df = pd.DataFrame(dict, columns=dict.keys())
sns.heatmap(df.corr())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With