I've a data frame with 100+ columns. cor() returns remarkably quickly, but tells me far too much, especially as most columns are not correlated. I'd like it to just tell me column pairs and their correlation, ideally ordered.
In case that doesn't make sense here is an artificial example:
df = data.frame(a=1:10,b=20:11*20:11,c=runif(10),d=runif(10),e=runif(10)*1:10) z = cor(df)
z looks like this:
a b c d e a 1.0000000 -0.9966867 -0.38925240 -0.35142452 0.2594220 b -0.9966867 1.0000000 0.40266637 0.35896626 -0.2859906 c -0.3892524 0.4026664 1.00000000 0.03958307 0.1781210 d -0.3514245 0.3589663 0.03958307 1.00000000 -0.3901608 e 0.2594220 -0.2859906 0.17812098 -0.39016080 1.0000000
What I'm looking for is a function that will instead tell me:
a:b -0.9966867 b:c 0.4026664 d:e -0.39016080 a:c -0.3892524 b:d 0.3589663 a:d -0.3514245 b:e -0.2859906 a:e 0.2594220 c:e 0.17812098 c:d 0.03958307
I have a crude way to get rid of some of the noise:
z[abs(z)<0.5]=0
then scan looking for non-zero values. But it is far inferior to the desired output above.
UPDATE: Based on the answers received, and some trial and error, here is the solution I went with:
z[lower.tri(z,diag=TRUE)]=NA #Prepare to drop duplicates and meaningless information z=as.data.frame(as.table(z)) #Turn into a 3-column table z=na.omit(z) #Get rid of the junk we flagged above z=z[order(-abs(z$Freq)),] #Sort by highest correlation (whether +ve or -ve)
A correlation matrix is simply a table which displays the correlation coefficients for different variables. The matrix depicts the correlation between all the possible pairs of values in a table. It is a powerful tool to summarize a large dataset and to identify and visualize patterns in the given data.
A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.
I always use
zdf <- as.data.frame(as.table(z)) zdf # Var1 Var2 Freq # 1 a a 1.00000 # 2 b a -0.99669 # 3 c a -0.14063 # 4 d a -0.28061 # 5 e a 0.80519
Then use subset(zdf, abs(Freq) > 0.5)
to select significant values.
library(reshape) z[z == 1] <- NA #drop perfect z[abs(z) < 0.5] <- NA # drop less than abs(0.5) z <- na.omit(melt(z)) # melt! z[order(-abs(z$value)),] # sort
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With