I have the following correlation matrix which was created using pandas: df.corr()
symbol aaa bbb ccc ddd eee
symbol
aaa 1.000000 0.346099 0.131874 -0.150910 0.177589
bbb 0.346099 1.000000 0.177308 -0.384893 0.301150
ccc 0.131874 0.177308 1.000000 -0.176995 0.258812
ddd -0.150910 -0.384893 -0.176995 1.000000 -0.310137
eee 0.177589 0.301150 0.258812 -0.310137 1.000000
From the above dataframe, I need to transform it into a 3 column dataframe as follows:
aaa aaa 1.000000
aaa bbb 0.346099
aaa ccc 0.131874
aaa ddd -0.150910
aaa eee 0.177589
bbb aaa 0.346099
bbb bbb 1.000000
bbb ccc 0.177308
bbb ddd -0.384893
bbb eee 0.301150
ccc aaa 0.131874
ccc bbb 0.177308
ccc ccc 1.000000
ccc ddd -0.176995
ccc eee 0.258812
ddd aaa -0.150910
ddd bbb -0.384893
ddd ccc -0.176995
ddd ddd 1.000000
ddd eee -0.310137
eee aaa 0.177589
eee bbb 0.301150
eee ccc 0.258812
eee ddd -0.310137
eee eee 1.000000
As shown, it is the same data, just presented differently. Each column/row pair from the original dataframe is simply grouped together into it's own row in the new dataframe.
Unfortunately I can't figure out how to get this done with the result being a dataframe. I have tried doing df.stack()
but the the result of this is a Series
. I need it to be a dataframe so that I can work with the columns. The other problem with df.stack()
is that it does not fill in every row, here is a small sample of the issue:
aaa aaa 1.000000
bbb 0.346099
ccc 0.131874
ddd -0.150910
eee 0.177589
bbb aaa 0.346099
bbb 1.000000
ccc 0.177308
ddd -0.384893
eee 0.301150
etc...
you can directly use a. reshape((2,2)) to reshape a Series, but you can not reshape a pandas DataFrame directly, because there is no reshape function for pandas DataFrame, but you can do reshape on numpy ndarray: convert DataFrame to numpy ndarray. do reshape.
You can use the following basic syntax to convert a pandas DataFrame from a wide format to a long format: df = pd. melt(df, id_vars='col1', value_vars=['col2', 'col3', ...])
In Pandas data reshaping means the transformation of the structure of a table or vector (i.e. DataFrame or Series) to make it suitable for further analysis. Some of Pandas reshaping capabilities do not readily exist in other environments (e.g. SQL or bare bone R) and can be tricky for a beginner.
Interpreting the value of ρ0.9 to 1 positive or negative indicates a very strong correlation. 0.7 to 0.9 positive or negative indicates a strong correlation. 0.5 to 0.7 positive or negative indicates a moderate correlation. 0.3 to 0.5 positive or negative indicates a weak correlation.
You need add reset_index
:
#reset columns and index names
df = df.rename_axis(None).rename_axis(None, axis=1)
#if pandas version below 0.18.0
#df.columns.name = None
#df.index.name = None
print (df)
aaa bbb ccc ddd eee
aaa 1.000000 0.346099 0.131874 -0.150910 0.177589
bbb 0.346099 1.000000 0.177308 -0.384893 0.301150
ccc 0.131874 0.177308 1.000000 -0.176995 0.258812
ddd -0.150910 -0.384893 -0.176995 1.000000 -0.310137
eee 0.177589 0.301150 0.258812 -0.310137 1.000000
df1 = df.stack().reset_index()
#set column names
df1.columns = ['a','b','c']
print (df1)
a b c
0 aaa aaa 1.000000
1 aaa bbb 0.346099
2 aaa ccc 0.131874
3 aaa ddd -0.150910
4 aaa eee 0.177589
5 bbb aaa 0.346099
6 bbb bbb 1.000000
7 bbb ccc 0.177308
8 bbb ddd -0.384893
9 bbb eee 0.301150
10 ccc aaa 0.131874
11 ccc bbb 0.177308
12 ccc ccc 1.000000
13 ccc ddd -0.176995
14 ccc eee 0.258812
15 ddd aaa -0.150910
16 ddd bbb -0.384893
17 ddd ccc -0.176995
18 ddd ddd 1.000000
19 ddd eee -0.310137
20 eee aaa 0.177589
21 eee bbb 0.301150
22 eee ccc 0.258812
23 eee ddd -0.310137
24 eee eee 1.000000
Use the code below to (a) reshape the correlation matrix, (b) remove duplicate rows (e.g., {aaa, bbb}
and {bbb, aaa}
), and (c) remove rows that contain the same variable in the first two columns (e.g., {aaa, aaa}
):
# calculate the correlation matrix and reshape
df_corr = df.corr().stack().reset_index()
# rename the columns
df_corr.columns = ['FEATURE_1', 'FEATURE_2', 'CORRELATION']
# create a mask to identify rows with duplicate features as mentioned above
mask_dups = (df_corr[['FEATURE_1', 'FEATURE_2']].apply(frozenset, axis=1).duplicated()) | (df_corr['FEATURE_1']==df_corr['FEATURE_2'])
# apply the mask to clean the correlation dataframe
df_corr = df_corr[~mask_dups]
This will generate an output like this:
FEATURE_1 FEATURE_2 CORRELATION
0 aaa bbb 0.346099
1 aaa ccc 0.131874
2 aaa ddd -0.150910
3 aaa eee 0.177589
4 bbb ccc 0.177308
5 bbb ddd -0.384893
6 bbb eee 0.301150
7 ccc ddd -0.176995
8 ccc eee 0.258812
9 ddd eee -0.310137
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With