Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reshaping a pandas correlation matrix

I have the following correlation matrix which was created using pandas: df.corr()

symbol       aaa       bbb       ccc       ddd       eee
symbol                                                  
aaa     1.000000  0.346099  0.131874 -0.150910  0.177589
bbb     0.346099  1.000000  0.177308 -0.384893  0.301150
ccc     0.131874  0.177308  1.000000 -0.176995  0.258812
ddd    -0.150910 -0.384893 -0.176995  1.000000 -0.310137
eee     0.177589  0.301150  0.258812 -0.310137  1.000000

From the above dataframe, I need to transform it into a 3 column dataframe as follows:

aaa     aaa       1.000000
aaa     bbb       0.346099
aaa     ccc       0.131874
aaa     ddd      -0.150910
aaa     eee       0.177589
bbb     aaa       0.346099
bbb     bbb       1.000000
bbb     ccc       0.177308
bbb     ddd      -0.384893
bbb     eee       0.301150
ccc     aaa       0.131874
ccc     bbb       0.177308
ccc     ccc       1.000000
ccc     ddd      -0.176995
ccc     eee       0.258812
ddd     aaa      -0.150910
ddd     bbb      -0.384893
ddd     ccc      -0.176995
ddd     ddd       1.000000
ddd     eee      -0.310137
eee     aaa       0.177589
eee     bbb       0.301150
eee     ccc       0.258812
eee     ddd      -0.310137
eee     eee       1.000000

As shown, it is the same data, just presented differently. Each column/row pair from the original dataframe is simply grouped together into it's own row in the new dataframe.

Unfortunately I can't figure out how to get this done with the result being a dataframe. I have tried doing df.stack() but the the result of this is a Series. I need it to be a dataframe so that I can work with the columns. The other problem with df.stack() is that it does not fill in every row, here is a small sample of the issue:

aaa     aaa       1.000000
        bbb       0.346099
        ccc       0.131874
        ddd      -0.150910
        eee       0.177589
bbb     aaa       0.346099
        bbb       1.000000
        ccc       0.177308
        ddd      -0.384893
        eee       0.301150
etc...
like image 381
darkpool Avatar asked Jun 27 '16 15:06

darkpool


People also ask

How do you reshape a matrix in Pandas?

you can directly use a. reshape((2,2)) to reshape a Series, but you can not reshape a pandas DataFrame directly, because there is no reshape function for pandas DataFrame, but you can do reshape on numpy ndarray: convert DataFrame to numpy ndarray. do reshape.

How do you reshape a Pandas DataFrame?

You can use the following basic syntax to convert a pandas DataFrame from a wide format to a long format: df = pd. melt(df, id_vars='col1', value_vars=['col2', 'col3', ...])

What is reshaping in Pandas?

In Pandas data reshaping means the transformation of the structure of a table or vector (i.e. DataFrame or Series) to make it suitable for further analysis. Some of Pandas reshaping capabilities do not readily exist in other environments (e.g. SQL or bare bone R) and can be tricky for a beginner.

How do you interpret a correlation in a panda?

Interpreting the value of ρ0.9 to 1 positive or negative indicates a very strong correlation. 0.7 to 0.9 positive or negative indicates a strong correlation. 0.5 to 0.7 positive or negative indicates a moderate correlation. 0.3 to 0.5 positive or negative indicates a weak correlation.


2 Answers

You need add reset_index:

#reset columns and index names 
df = df.rename_axis(None).rename_axis(None, axis=1)

#if pandas version below 0.18.0
#df.columns.name = None
#df.index.name = None

print (df)
          aaa       bbb       ccc       ddd       eee
aaa  1.000000  0.346099  0.131874 -0.150910  0.177589
bbb  0.346099  1.000000  0.177308 -0.384893  0.301150
ccc  0.131874  0.177308  1.000000 -0.176995  0.258812
ddd -0.150910 -0.384893 -0.176995  1.000000 -0.310137
eee  0.177589  0.301150  0.258812 -0.310137  1.000000
df1 = df.stack().reset_index()
#set column names
df1.columns = ['a','b','c']
print (df1)
      a    b         c
0   aaa  aaa  1.000000
1   aaa  bbb  0.346099
2   aaa  ccc  0.131874
3   aaa  ddd -0.150910
4   aaa  eee  0.177589
5   bbb  aaa  0.346099
6   bbb  bbb  1.000000
7   bbb  ccc  0.177308
8   bbb  ddd -0.384893
9   bbb  eee  0.301150
10  ccc  aaa  0.131874
11  ccc  bbb  0.177308
12  ccc  ccc  1.000000
13  ccc  ddd -0.176995
14  ccc  eee  0.258812
15  ddd  aaa -0.150910
16  ddd  bbb -0.384893
17  ddd  ccc -0.176995
18  ddd  ddd  1.000000
19  ddd  eee -0.310137
20  eee  aaa  0.177589
21  eee  bbb  0.301150
22  eee  ccc  0.258812
23  eee  ddd -0.310137
24  eee  eee  1.000000
like image 140
jezrael Avatar answered Sep 21 '22 12:09

jezrael


Use the code below to (a) reshape the correlation matrix, (b) remove duplicate rows (e.g., {aaa, bbb} and {bbb, aaa}), and (c) remove rows that contain the same variable in the first two columns (e.g., {aaa, aaa}):

# calculate the correlation matrix and reshape
df_corr = df.corr().stack().reset_index()

# rename the columns
df_corr.columns = ['FEATURE_1', 'FEATURE_2', 'CORRELATION']

# create a mask to identify rows with duplicate features as mentioned above
mask_dups = (df_corr[['FEATURE_1', 'FEATURE_2']].apply(frozenset, axis=1).duplicated()) | (df_corr['FEATURE_1']==df_corr['FEATURE_2']) 

# apply the mask to clean the correlation dataframe
df_corr = df_corr[~mask_dups]

This will generate an output like this:

    FEATURE_1  FEATURE_2  CORRELATION
0         aaa        bbb     0.346099
1         aaa        ccc     0.131874
2         aaa        ddd    -0.150910
3         aaa        eee     0.177589
4         bbb        ccc     0.177308
5         bbb        ddd    -0.384893
6         bbb        eee     0.301150
7         ccc        ddd    -0.176995
8         ccc        eee     0.258812
9         ddd        eee    -0.310137
like image 32
Vishal Avatar answered Sep 25 '22 12:09

Vishal