Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas: Get combination of columns where correlation is high

I have a data set with 6 columns, from which I let pandas calculate the correlation matrix, with the following result:

               age  earnings    height     hours  siblings    weight
age       1.000000  0.026032  0.040002  0.024118  0.155894  0.048655
earnings  0.026032  1.000000  0.276373  0.224283  0.126651  0.092299
height    0.040002  0.276373  1.000000  0.235616  0.077551  0.572538
hours     0.024118  0.224283  0.235616  1.000000  0.067797  0.143160
siblings  0.155894  0.126651  0.077551  0.067797  1.000000  0.018367
weight    0.048655  0.092299  0.572538  0.143160  0.018367  1.000000

How can I get the combination of colums where the correlation is, for example, higher than 0.5, but the columns are not equal? So in this case, the output needs to be something like:

[('height', 'weight')]

I tried to do it with for loops, but I think that's not the right/most efficient way:

correlated = []
for column1 in columns:
    for column2 in columns:
        if column1 != column2:
            correlation = df[column1].corr(df[column2])
            if correlation > 0.5 and (column2, column1) not in correlated:
                correlated.append((column1, column2))

In which df is my original dataframe. This outputs the desired result:

[(u'height', u'weight')]
like image 727
Peter Avatar asked Oct 20 '14 10:10

Peter


1 Answers

How about the following, using numpy, and assuming you already have your correlation matrix in df:

import numpy as np

indices = np.where(df > 0.5)
indices = [(df.index[x], df.columns[y]) for x, y in zip(*indices)
                                        if x != y and x < y]

This will result in indices containing:

[('height', 'weight')]
like image 65
Michael Brennan Avatar answered Sep 21 '22 21:09

Michael Brennan