Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correlation Matrix - tidyr gather v. reshape2 melt

I would like to use ggplot2 to make an upper triangle correlation matrix like this one. I can replicate that one just fine, but for some reason I'm stuck on really wanting to convert the reshape2 functions to tidyr ones. I would think that I could use gather in place of melt, but that is not working.

Original Results using reshape2

library(reshape2)
library(ggplot2)
mydata <- mtcars[, c(1,3,4,5,6,7)]
cormat <- round(cor(mydata),2)
library(reshape2)
melted_cormat <- melt(cormat)

# Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
    cormat[lower.tri(cormat)]<- NA
    return(cormat)
}

upper_tri <- get_upper_tri(cormat)

melted_cormat <- melt(upper_tri, na.rm = TRUE)

ggplot(data = melted_cormat, aes(Var2, Var1, fill = value)) + 
    geom_tile()

enter image description here

My attempt at this using gather from tidyr.

library(tidyverse)


#first correlatoin matrix
cor_base <- round(cor(mydata), 2)
#now UT
cor_base[lower.tri(cor_base)] <- NA
cor_tri <- as.data.frame(cor_base) %>% 
    rownames_to_column("Var2") %>% 
    gather(key = Var1, value = value, -Var2, na.rm = TRUE) %>% 
    as.data.frame()

ggplot(data = cor_tri, aes(x = Var2, y = Var1, fill = value)) + 
    geom_tile()

enter image description here

The values are all the same, but some change in order occurred that is making this look wrong. A check of identical doesn't return TRUE but the values of the two data frames seem to be the same...

> identical(cor_tri, melted_cormat)
[1] FALSE
> dim(cor_tri)
[1] 21  3
> dim(melted_cormat)
[1] 21  3
> sum(cor_tri == melted_cormat)
[1] 63

Any thoughts on this or should I just go ahead and load reshape2 to accomplish what I'm going for?

Thanks.

like image 829
Nick Criswell Avatar asked Nov 24 '17 15:11

Nick Criswell


1 Answers

Essentially, it is the factor and character types of Var1 and Var2 between the reshape2 and tidyr versions. The former's melt() retains factors and order of correlation matrix: "mpg", "disp", "hp", "drat", "wt", "qsec" and latter's tibble:rownames_to_colums() creates character types in alphabetical order: "disp", "drat", "hp", "mpg", "qsec", "wt". As seen both have different levels affecting plot rendering.

To resolve, consider a dplyr::mutate line using base::factor(rownames(.), ...) and explicitly define the levels as original arrangement of cor_base's row.names(). Also, your Var1 and Var2 were reversed.

cor_base <- round(cor(mydata), 2)
cor_base[lower.tri(cor_base)] <- NA

cor_tri <- as.data.frame(cor_base) %>% 
  mutate(Var1 = factor(row.names(.), levels=row.names(.))) %>% 
  gather(key = Var2, value = value, -Var1, na.rm = TRUE, factor_key = TRUE) 

ggplot(data = cor_tri, aes(Var2, Var1, fill = value)) + 
  geom_tile()

Cor Matrix Plot Output


Also, for you or future readers here is the base::reshape version that too resolves above factor level issue:

cor_base <- round(cor(mydata), 2)
cor_base[lower.tri(cor_base)] <- NA

cor_base_df <- transform(as.data.frame(cor_base),
                         Var1 = factor(row.names(cor_base), levels=row.names(cor_base)))

cor_long <- subset(reshape(cor_base_df, idvar=c("Var1"), 
                           varying = c(1:(ncol(cor_base_df)-1)), v.names="value",
                           timevar = "Var2", 
                           times = factor(row.names(cor_base), levels=row.names(cor_base)),
                           new.row.names = 1:100,
                           direction = "long"), !is.na(value))

ggplot(data = cor_long, aes(Var2, Var1, fill = value)) + 
  geom_tile()
like image 171
Parfait Avatar answered Oct 14 '22 20:10

Parfait