Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

get unique combination values of a correlation matrix - pandas

Let's suppose I have a correlation matrix that looks like this:

df = pd.DataFrame(data={'a':[1,0.2,0.3,0.4],'b':[0.2,1,0.5,0.6],'c':[0.3,0.5,1,0.7],'d':[0.4,0.6,0.7,1]}, index=['a','b','c','d'])

what is the best way to extract the unique values of each pairwise combination (a-b, a-c, etc)?

df2 =
      a_b  a_c  a_d  b_c  b_d  c_d 
      0.2  0.3  0.4  0.5  0.6  0.7

the only way I see doing this is to write my own function, but was wondering if someone knows a shortcut for this

like image 699
HappyPy Avatar asked Dec 08 '22 14:12

HappyPy


2 Answers

IIUC:

df_out = df.stack()
df_out.index = df_out.index.map('_'.join)
df_out = df_out.to_frame().T

Output:

   a_a  a_b  a_c  a_d  b_a  b_b  b_c  b_d  c_a  c_b  c_c  c_d  d_a  d_b  d_c  
0  1.0  0.2  0.3  0.4  0.2  1.0  0.5  0.6  0.3  0.5  1.0  0.7  0.4  0.6  0.7   

And, if you want to get rid of a_a, b_b, etc..

df_out = df.stack()
df_out = df_out[df_out.index.get_level_values(0) != df_out.index.get_level_values(1)]
df_out.index = df_out.index.map('_'.join)
df_out = df_out.to_frame().T

Output

   a_b  a_c  a_d  b_a  b_c  b_d  c_a  c_b  c_d  d_a  d_b  d_c
0  0.2  0.3  0.4  0.2  0.5  0.6  0.3  0.5  0.7  0.4  0.6  0.7

Or to get rid of b_a and keep a_b:

df_out = df.stack()
df_out = df_out[df_out.index.get_level_values(0) < df_out.index.get_level_values(1)]
df_out.index = df_out.index.map('_'.join)
df_out = df_out.to_frame().T

Or combining a few lines using lambda function in .loc:

df_out = df.stack().loc[lambda x: x.index.get_level_values(0) < x.index.get_level_values(1)]
df_out.index = df_out.index.map('_'.join)
df_out = df_out.to_frame().T

Output:

   a_b  a_c  a_d  b_c  b_d  c_d
0  0.2  0.3  0.4  0.5  0.6  0.7
like image 133
Scott Boston Avatar answered Dec 10 '22 03:12

Scott Boston


IIUC, you can play with indexes

df2 = df.unstack().reset_index()
s = df2[['level_0', 'level_1']].agg(frozenset,1).drop_duplicates()
df2 = df2.loc[s.index]
ind = df2.agg(lambda k:  (k['level_0']+'_'+k['level_1']), axis=1)
df2.set_index(ind)[0].to_frame().T

    a_a a_b a_c a_d b_b b_c b_d c_c c_d d_d
0   1.0 0.2 0.3 0.4 1.0 0.5 0.6 1.0 0.7 1.0
like image 35
rafaelc Avatar answered Dec 10 '22 02:12

rafaelc