Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Show correlations as an ordered list, not as a large matrix

Tags:

r

I've a data frame with 100+ columns. cor() returns remarkably quickly, but tells me far too much, especially as most columns are not correlated. I'd like it to just tell me column pairs and their correlation, ideally ordered.

In case that doesn't make sense here is an artificial example:

df = data.frame(a=1:10,b=20:11*20:11,c=runif(10),d=runif(10),e=runif(10)*1:10) z = cor(df) 

z looks like this:

           a          b           c           d          e a  1.0000000 -0.9966867 -0.38925240 -0.35142452  0.2594220 b -0.9966867  1.0000000  0.40266637  0.35896626 -0.2859906 c -0.3892524  0.4026664  1.00000000  0.03958307  0.1781210 d -0.3514245  0.3589663  0.03958307  1.00000000 -0.3901608 e  0.2594220 -0.2859906  0.17812098 -0.39016080  1.0000000 

What I'm looking for is a function that will instead tell me:

a:b -0.9966867  b:c  0.4026664 d:e -0.39016080   a:c -0.3892524  b:d  0.3589663 a:d -0.3514245  b:e -0.2859906 a:e  0.2594220  c:e  0.17812098 c:d  0.03958307 

I have a crude way to get rid of some of the noise:

z[abs(z)<0.5]=0 

then scan looking for non-zero values. But it is far inferior to the desired output above.

UPDATE: Based on the answers received, and some trial and error, here is the solution I went with:

z[lower.tri(z,diag=TRUE)]=NA  #Prepare to drop duplicates and meaningless information z=as.data.frame(as.table(z))  #Turn into a 3-column table z=na.omit(z)  #Get rid of the junk we flagged above z=z[order(-abs(z$Freq)),]    #Sort by highest correlation (whether +ve or -ve) 
like image 263
Darren Cook Avatar asked Aug 16 '11 06:08

Darren Cook


People also ask

What shows correlation matrix?

A correlation matrix is simply a table which displays the correlation coefficients for different variables. The matrix depicts the correlation between all the possible pairs of values in a table. It is a powerful tool to summarize a large dataset and to identify and visualize patterns in the given data.

Is correlation and correlation matrix the same?

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.


2 Answers

I always use

zdf <- as.data.frame(as.table(z)) zdf #    Var1 Var2     Freq # 1     a    a  1.00000 # 2     b    a -0.99669 # 3     c    a -0.14063 # 4     d    a -0.28061 # 5     e    a  0.80519 

Then use subset(zdf, abs(Freq) > 0.5) to select significant values.

like image 142
Marek Avatar answered Oct 03 '22 11:10

Marek


library(reshape)  z[z == 1] <- NA #drop perfect z[abs(z) < 0.5] <- NA # drop less than abs(0.5) z <- na.omit(melt(z)) # melt!  z[order(-abs(z$value)),] # sort 
like image 36
Brandon Bertelsen Avatar answered Oct 03 '22 12:10

Brandon Bertelsen