Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mean of a correlation matrix - pandas data fram

Tags:

python

pandas

I have a large correlation matrix in a pandas python DataFrame: df (342, 342).

How do I take the mean, sd, etc. of all of the numbers in the upper triangle not including the 1's along the diagonal?

Thank you.

like image 610
user1911092 Avatar asked Jan 02 '13 22:01

user1911092


Video Answer


2 Answers

Another potential one line answer:

In [1]: corr
Out[1]:
          a         b         c         d         e
a  1.000000  0.022246  0.018614  0.022592  0.008520
b  0.022246  1.000000  0.033029  0.049714 -0.008243
c  0.018614  0.033029  1.000000 -0.016244  0.049010
d  0.022592  0.049714 -0.016244  1.000000 -0.015428
e  0.008520 -0.008243  0.049010 -0.015428  1.000000

In [2]: corr.values[np.triu_indices_from(corr.values,1)].mean()
Out[2]: 0.016381

Edit: added performance metrics

Performance of my solution:

In [3]: %timeit corr.values[np.triu_indices_from(corr.values,1)].mean()
10000 loops, best of 3: 48.1 us per loop

Performance of Theodros Zelleke's one-line solution:

In [4]: %timeit corr.unstack().ix[zip(*np.triu_indices_from(corr, 1))].mean()
1000 loops, best of 3: 823 us per loop

Performance of DSM's solution:

In [5]: def method1(df):
   ...:     df2 = df.copy()
   ...:     df2.values[np.tril_indices_from(df2)] = np.nan
   ...:     return df2.unstack().mean()
   ...:

In [5]: %timeit method1(corr)
1000 loops, best of 3: 242 us per loop
like image 178
Zelazny7 Avatar answered Sep 20 '22 23:09

Zelazny7


This is kind of fun. I make no guarantees that this is the real pandas-fu; I'm still at the "numpy + better indexing" stage of learning pandas myself. That said, something like this should get the job done.

First, we make a toy correlation matrix to play with:

>>> import pandas as pd
>>> import numpy as np
>>> frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])
>>> corr = frame.corr()
>>> corr
          a         b         c         d         e
a  1.000000  0.022246  0.018614  0.022592  0.008520
b  0.022246  1.000000  0.033029  0.049714 -0.008243
c  0.018614  0.033029  1.000000 -0.016244  0.049010
d  0.022592  0.049714 -0.016244  1.000000 -0.015428
e  0.008520 -0.008243  0.049010 -0.015428  1.000000

Then we make a copy, and use tril_indices_from to get at the lower indices to mask them:

>>> c2 = corr.copy()
>>> c2.values[np.tril_indices_from(c2)] = np.nan
>>> c2
    a        b         c         d         e
a NaN  0.06952 -0.021632 -0.028412 -0.029729
b NaN      NaN -0.022343 -0.063658  0.055247
c NaN      NaN       NaN -0.013272  0.029102
d NaN      NaN       NaN       NaN -0.046877
e NaN      NaN       NaN       NaN       NaN

and now we can do stats on the flattened array:

>>> c2.unstack().mean()
-0.0072054178481488901
>>> c2.unstack().std()
0.043839624201635466
like image 39
DSM Avatar answered Sep 18 '22 23:09

DSM