I've a data frame with 100+ columns. cor() returns remarkably quickly, but tells me far too much, especially as most columns are not correlated. I'd like it to just tell me column pairs and their correlation, ideally ordered. In case that doesn't make sense here is an artificial example: <pre class="prettyprint"><code>df = data.frame(a=1:10,b=20:11*20:11,c=runif(10),d=runif(10),e=runif(10)*1:10) z = cor(df) </code></pre> z looks like this: <pre class="prettyprint"><code> a b c d e a 1.0000000 -0.9966867 -0.38925240 -0.35142452 0.2594220 b -0.9966867 1.0000000 0.40266637 0.35896626 -0.2859906 c -0.3892524 0.4026664 1.00000000 0.03958307 0.1781210 d -0.3514245 0.3589663 0.03958307 1.00000000 -0.3901608 e 0.2594220 -0.2859906 0.17812098 -0.39016080 1.0000000 </code></pre> What I'm looking for is a function that will instead tell me: <pre class="prettyprint"><code>a:b -0.9966867 b:c 0.4026664 d:e -0.39016080 a:c -0.3892524 b:d 0.3589663 a:d -0.3514245 b:e -0.2859906 a:e 0.2594220 c:e 0.17812098 c:d 0.03958307 </code></pre> I have a crude way to get rid of some of the noise: <pre class="prettyprint"><code>z[abs(z)<0.5]=0 </code></pre> then scan looking for non-zero values. But it is far inferior to the desired output above. UPDATE: Based on the answers received, and some trial and error, here is the solution I went with: <pre class="prettyprint"><code>z[lower.tri(z,diag=TRUE)]=NA #Prepare to drop duplicates and meaningless information z=as.data.frame(as.table(z)) #Turn into a 3-column table z=na.omit(z) #Get rid of the junk we flagged above z=z[order(-abs(z$Freq)),] #Sort by highest correlation (whether +ve or -ve) </code></pre>

<pre class="prettyprint"><code>library(reshape) z[z == 1] <- NA #drop perfect z[abs(z) < 0.5] <- NA # drop less than abs(0.5) z <- na.omit(melt(z)) # melt! z[order(-abs(z$value)),] # sort </code></pre>

Show correlations as an ordered list, not as a large matrix

Tags:

r

I've a data frame with 100+ columns. cor() returns remarkably quickly, but tells me far too much, especially as most columns are not correlated. I'd like it to just tell me column pairs and their correlation, ideally ordered.

In case that doesn't make sense here is an artificial example:

df = data.frame(a=1:10,b=20:11*20:11,c=runif(10),d=runif(10),e=runif(10)*1:10) z = cor(df)

z looks like this:

           a          b           c           d          e a  1.0000000 -0.9966867 -0.38925240 -0.35142452  0.2594220 b -0.9966867  1.0000000  0.40266637  0.35896626 -0.2859906 c -0.3892524  0.4026664  1.00000000  0.03958307  0.1781210 d -0.3514245  0.3589663  0.03958307  1.00000000 -0.3901608 e  0.2594220 -0.2859906  0.17812098 -0.39016080  1.0000000

What I'm looking for is a function that will instead tell me:

a:b -0.9966867  b:c  0.4026664 d:e -0.39016080   a:c -0.3892524  b:d  0.3589663 a:d -0.3514245  b:e -0.2859906 a:e  0.2594220  c:e  0.17812098 c:d  0.03958307

I have a crude way to get rid of some of the noise:

z[abs(z)<0.5]=0

then scan looking for non-zero values. But it is far inferior to the desired output above.

UPDATE: Based on the answers received, and some trial and error, here is the solution I went with:

z[lower.tri(z,diag=TRUE)]=NA  #Prepare to drop duplicates and meaningless information z=as.data.frame(as.table(z))  #Turn into a 3-column table z=na.omit(z)  #Get rid of the junk we flagged above z=z[order(-abs(z$Freq)),]    #Sort by highest correlation (whether +ve or -ve)

263

asked Aug 16 '11 06:08

Darren Cook

2 Answers

I always use

zdf <- as.data.frame(as.table(z)) zdf #    Var1 Var2     Freq # 1     a    a  1.00000 # 2     b    a -0.99669 # 3     c    a -0.14063 # 4     d    a -0.28061 # 5     e    a  0.80519

Then use subset(zdf, abs(Freq) > 0.5) to select significant values.

142

answered Oct 03 '22 11:10

Marek

library(reshape)  z[z == 1] <- NA #drop perfect z[abs(z) < 0.5] <- NA # drop less than abs(0.5) z <- na.omit(melt(z)) # melt!  z[order(-abs(z$value)),] # sort

answered Oct 03 '22 12:10

Brandon Bertelsen

Related questions
                            
                                Can you make R print more detailed error messages?
                            
                                How do i get the web browser password store to remember R/Shiny passwords?
                            
                                R avoiding "restarting interrupted promise evaluation" warning
                            
                                Upload a file over 2.15 GB in R
                            
                                Error calling serialize R function
                            
                                How to effectively deal with uncompressed saves during package check?
                            
                                In R, what does "loaded via a namespace (and not attached)" mean?
                            
                                When writing my own R package, I can't seem to get other packages to import correctly
                            
                                Most efficient list to data.frame method?
                            
                                Pass arguments into function within a function
                            
                                How do I show the source code of an S4 function in a package?
                            
                                Count number of columns by a condition (>) for each row
                            
                                What are the differences between vector, matrix and array data types?
                            
                                How to read a .csv file containing apostrophes into R?
                            
                                change thickness of the whole line geom_boxplot()
                            
                                Annotate ggplot2 facets with number of observations per facet [duplicate]
                            
                                Text clustering with Levenshtein distances
                            
                                Fastest way to add rows for missing time steps?
                            
                                How to use the strsplit function with a period
                            
                                Is it possible to get code completion for R in Emacs ESS similar to what is available in Rstudio?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With