I have a data set with 6 columns, from which I let pandas calculate the correlation matrix, with the following result:
age earnings height hours siblings weight
age 1.000000 0.026032 0.040002 0.024118 0.155894 0.048655
earnings 0.026032 1.000000 0.276373 0.224283 0.126651 0.092299
height 0.040002 0.276373 1.000000 0.235616 0.077551 0.572538
hours 0.024118 0.224283 0.235616 1.000000 0.067797 0.143160
siblings 0.155894 0.126651 0.077551 0.067797 1.000000 0.018367
weight 0.048655 0.092299 0.572538 0.143160 0.018367 1.000000
How can I get the combination of colums where the correlation is, for example, higher than 0.5, but the columns are not equal? So in this case, the output needs to be something like:
[('height', 'weight')]
I tried to do it with for loops, but I think that's not the right/most efficient way:
correlated = []
for column1 in columns:
for column2 in columns:
if column1 != column2:
correlation = df[column1].corr(df[column2])
if correlation > 0.5 and (column2, column1) not in correlated:
correlated.append((column1, column2))
In which df is my original dataframe. This outputs the desired result:
[(u'height', u'weight')]
How about the following, using numpy, and assuming you already have your correlation matrix in df
:
import numpy as np
indices = np.where(df > 0.5)
indices = [(df.index[x], df.columns[y]) for x, y in zip(*indices)
if x != y and x < y]
This will result in indices
containing:
[('height', 'weight')]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With