I want to generate graphs between variables (columns) that have a correlation above and below a certain point as well as having a pvalue < 0.01. The graphs would be ggplot2 (line or bar) graphs plotting the two columns (variables) that correlate.
Here is the gist of my approach so far, with some dummy data, I would love a pointer in where to go next.
# Create some dummy data
df <- data.frame(sample(1:50), sample(1:50), sample(1:50), sample(1:50))
colnames(df) <- c("var1", "var2", "var3", "var4")
# Find correlations in the dummy data
df.cor <- cor(df)
# Make up some random pvalues for this example
x <- 0:1000
df.cor.pvals <- data.frame(sample(x/1000, 4), sample(x/1000, 4), sample(x/1000, 4), sample(x/1000,4))
colnames(df.cor.pvals) <- c("var1", "var2", "var3", "var4")
# Find the significant correlations
df.cor.extreme <- ((df.cor < -0.01 | df.cor > 0.01) & df.cor.pvals < 0.5)
# Ready data to for plotting
df$rownames <- rownames(df)
df.melt <- melt(df, id="rownames")
# I want to plot the combinations of variables that have a TRUE value
# in the df.cor.extreme matrix
Below is hardcoded example if var1 and var2 had a value of TRUE. I assume this is where I need some sort of loop to generate multiple plots where varA and varB are correlated.
ggplot(df.melt[(df.melt$variable=="var1" | df.melt$variable=="var2"),], aes(x=rownames, y=value, group=variable, colour=variable)) +
geom_line()
Yes, it is possible if you also keep the variable type in a column and you pick the appropriate correlation method based on the types.
Plot using Heatmaps There are many ways you can plot correlation matrices one efficient way is using the heatmap. It is very easy to understand the correlation using heatmaps it tells the correlation of one feature(variable) to every other feature(variable).
In this method, the user has to call the cor() function and then within this function the user has to pass the name of the multiple variables in the form of vector as its parameter to get the correlation among multiple variables by specifying multiple column names in the R programming language.
As said in the comment by @DrewSteen , p-avlue must be the same shape of cor.
Here I supply a function that compute p-value matrix( it should exist a build-in function, in stats package)
pvalue.matrix <- function(x,...){
ncx <- ncol(x)
r <- matrix(0, nrow = ncx, ncol = ncx)
for (i in seq_len(ncx)) {
for (j in seq_len(i)) {
x2 <- x[, i]
y2 <- x[, j]
r[i, j] <- cor.test(x2,y2,...)$p.value
}
}
r <- r + t(r) - diag(diag(r))
rownames(r) <- colnames(x)
colnames(r) <- colnames(x)
r
}
Then you use the vectorize version of | and & like this
df.cor.sig <- (df.cor > 0.01 | df.cor < -0.01) & pvalue.matrix(df) < 0.5
the plot is classic with geom_tile
library(reshape2) ## melt
library(plyr) ## round_any
library(ggplot2)
dat <- expand.grid(var1=1:4, var2=1:4)
dat$value <- melt(df.cor.sig)$value
dat$labels <- paste(round_any(df.cor,0.01) ,'(', round_any(pvalue.matrix(df),0.01),')',sep='')
ggplot(dat, aes(x=var1,y=var2,label=labels))+
geom_tile(aes(fill = value),colour='white')+
geom_text()
plots <- apply(dat,1,function(x){
plot.grob <- nullGrob()
if(length(grep(pattern='TRUE',x[3])) >0 ){
gg <- paste('var',c(x[1],x[2]),sep='')
p <- ggplot(subset(df.melt,variable %in% gg ),
aes(x=rownames, y=value, group=variable, colour=variable)) +
geom_line()
plot.grob <- ggplotGrob(p)
}
plot.grob
})
library(gridExtra)
do.call(grid.arrange, plots)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With