How do I loop across a correlation matrix to only give me pairs of correlations above a certain threshold? And/or make it more efficient

Question

I've got the following code:

for i in list(corr.columns):
    for j in list(corr.columns):
        if corr.ix[i,j]>0.7 and corr.ix[i,j] != 1:
            print i, ' ',j ,' ', corr.ix[i,j]

The problem is that whilst this works, it returns both corr[i,j] and corr[j,i] as if they were different correlations. Is there anyway I could just loop through just the 'bottom triangle' of the correlation matrix?

Jacob Amos · Accepted Answer

Below is one possibility, still using a loop structure similar to yours. Notice that by confining the possible value range for j, you eliminate much of the duplicative work from your loop. Additionally, while indexing with strings as you do might arguably make some programs more readable/robust, indexing a numpy 2d array with integers will probably prove faster (and more concise, since no .ix component). Indexing this way is also what allows you to skip testing elements you know you don't need.

# Get some toy data and extract some information from it
import pandas.io.data as pd
X = pd.DataReader('aapl','yahoo')
rows, cols = X.shape
flds = list(X.columns)

# Indexing with numbers on a numpy matrix will probably be faster
corr = X.corr().values

for i in range(cols):
    for j in range(i+1, cols):
        if corr[i,j] > 0.7:
            print flds[i], ' ', flds[j], ' ', corr[i,j]

Running the code above yields something like:

Open   High   0.99983447301
Open   Low   0.999763093885
Open   Close   0.999564997906
High   Low   0.999744241894
High   Close   0.999815965479
Low   Close   0.999794304851

How do I loop across a correlation matrix to only give me pairs of correlations above a certain threshold? And/or make it more efficient

Tags:

performance

python

loops

correlation

pakkunrob

1 Answers

Jacob Amos

Recent Activity

Donate For Us

How do I loop across a correlation matrix to only give me pairs of correlations above a certain threshold? And/or make it more efficient

Tags:

performance

python

loops

correlation

pakkunrob

1 Answers

Jacob Amos

Related questions

Recent Activity

Donate For Us