Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

visual structure of a data.frame: locations of NAs and much more

I want to represent the structure of a data frame (or matrix, or data.table whatever) on a single plot with color-coding. I guess that could be very useful for many people handling various types of data, to visualize it in a single glance.

Perhaps someone have already developed a package to do it, but I couldn't find one (just this). So here is a rough mockup of my "vision", kind of a heatmap, showing in color codes:

  • the NA locations,
  • the class of variables (factors (how many levels?), numeric (with color gradient, zeros, outliers...), strings)
  • dimensions
  • etc.....

enter image description here

So far I have just written a function to plot the NA locations it goes like this:

ggSTR = function(data, alpha=0.5){
  require(ggplot2)
  DF <- data
  if (!is.matrix(data)) DF <- as.matrix(DF)

  to.plot <- cbind.data.frame('y'=rep(1:nrow(DF), each=ncol(DF)), 
                              'x'=as.logical(t(is.na(DF)))*rep(1:ncol(DF), nrow(DF)))
  size <- 20 / log( prod(dim(DF)) )  # size of point depend on size of table
  g <- ggplot(data=to.plot) + aes(x,y) +
        geom_point(size=size, color="red", alpha=alpha) +
        scale_y_reverse() + xlim(1,ncol(DF)) +
        ggtitle("location of NAs in the data frame")

  pc <- round(sum(is.na(DF))/prod(dim(DF))*100, 2) # % NA
  print(paste("percentage of NA data: ", pc))

  return(g)
}

It takes any data.frame in input and returns this image:

enter image description here

It's too big a challenge for me to achieve the first image.

like image 239
agenis Avatar asked Dec 18 '14 11:12

agenis


People also ask

How do you display the structure of a data frame?

To get the structure of a data frame we use a built-in function called srt(). We have to pass the data frame which is already created. If we don't pass an already created data frame we are not getting anything.

How do I check if a value is na in R?

To check which value in NA in an R data frame, we can use apply function along with is.na function. This will return the data frame in logical form with TRUE and FALSE.


1 Answers

You can try out visdat package(https://github.com/ropensci/visdat), which shows the NA values and data types in the plot

install.packages("visdat")
library(visdat)
vis_dat(airquality)
like image 181
shaojl7 Avatar answered Nov 15 '22 20:11

shaojl7