Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I loop across a correlation matrix to only give me pairs of correlations above a certain threshold? And/or make it more efficient

I've got the following code:

for i in list(corr.columns):
    for j in list(corr.columns):
        if corr.ix[i,j]>0.7 and corr.ix[i,j] != 1:
            print i, ' ',j ,' ', corr.ix[i,j] 

The problem is that whilst this works, it returns both corr[i,j] and corr[j,i] as if they were different correlations. Is there anyway I could just loop through just the 'bottom triangle' of the correlation matrix?

like image 590
pakkunrob Avatar asked Mar 13 '23 13:03

pakkunrob


1 Answers

Below is one possibility, still using a loop structure similar to yours. Notice that by confining the possible value range for j, you eliminate much of the duplicative work from your loop. Additionally, while indexing with strings as you do might arguably make some programs more readable/robust, indexing a numpy 2d array with integers will probably prove faster (and more concise, since no .ix component). Indexing this way is also what allows you to skip testing elements you know you don't need.

# Get some toy data and extract some information from it
import pandas.io.data as pd
X = pd.DataReader('aapl','yahoo')
rows, cols = X.shape
flds = list(X.columns)

# Indexing with numbers on a numpy matrix will probably be faster
corr = X.corr().values

for i in range(cols):
    for j in range(i+1, cols):
        if corr[i,j] > 0.7:
            print flds[i], ' ', flds[j], ' ', corr[i,j]

Running the code above yields something like:

Open   High   0.99983447301
Open   Low   0.999763093885
Open   Close   0.999564997906
High   Low   0.999744241894
High   Close   0.999815965479
Low   Close   0.999794304851
like image 158
Jacob Amos Avatar answered Apr 06 '23 03:04

Jacob Amos