I've got the following code:
for i in list(corr.columns):
for j in list(corr.columns):
if corr.ix[i,j]>0.7 and corr.ix[i,j] != 1:
print i, ' ',j ,' ', corr.ix[i,j]
The problem is that whilst this works, it returns both corr[i,j] and corr[j,i] as if they were different correlations. Is there anyway I could just loop through just the 'bottom triangle' of the correlation matrix?
Below is one possibility, still using a loop structure similar to yours. Notice that by confining the possible value range for j
, you eliminate much of the duplicative work from your loop. Additionally, while indexing with strings as you do might arguably make some programs more readable/robust, indexing a numpy 2d array with integers will probably prove faster (and more concise, since no .ix
component). Indexing this way is also what allows you to skip testing elements you know you don't need.
# Get some toy data and extract some information from it
import pandas.io.data as pd
X = pd.DataReader('aapl','yahoo')
rows, cols = X.shape
flds = list(X.columns)
# Indexing with numbers on a numpy matrix will probably be faster
corr = X.corr().values
for i in range(cols):
for j in range(i+1, cols):
if corr[i,j] > 0.7:
print flds[i], ' ', flds[j], ' ', corr[i,j]
Running the code above yields something like:
Open High 0.99983447301
Open Low 0.999763093885
Open Close 0.999564997906
High Low 0.999744241894
High Close 0.999815965479
Low Close 0.999794304851
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With