Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inspecting and visualizing gaps/blanks and structure in large dataframes

I have a large dataframe (400000 x 50) that I want to visually inspect for structure and blanks/gaps.

Is there an existing library or ggplot2 function, that can spit out a picture like this:

Desired Output

Where red might be "Dates", blue for "factors", green for "characters", and black for blanks/NAs.

like image 488
emehex Avatar asked Mar 02 '15 15:03

emehex


2 Answers

Have you tried dfviewr in lasagnar ? The following reproduces the desired graphic for the 50 row x 10 column df.in in the package:

library(devtools)
install_github("swihart/lasagnar")
library(lasagnar)   
dfviewr(df=df.in)
## also try:
##dfviewr(df=df.in, legend=FALSE)
##dfviewr(df=df.in, gridlines=FALSE)

enter image description here

So, to be fair, dfviewr didn't exist at the time of the question, but to see some of the ideas that led to its development and how to actually visualize 400,000 rows, see the for-loop at the very bottom, and don't be too foolhardy and run the function on df2.in (400,000 x 50):

## Do not run:
## system.time(dfviewr(df=df2.in, gridlines=FALSE)) ## 10 minutes before useRaster=TRUE                                          
                                                    ##  2 minutes after

Also, tabplot:::tableplot() doesn't seem to support dates or characters:

library(tabplot)
tableplot(df.in)

produces:

Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : vmode 'character' not implemented

and so we eliminate the character column (#9):

tableplot(df.in[,c(-9)])

which produces:

Error in UseMethod("as.hi") : no applicable method for 'as.hi' applied to an object of class "c('POSIXct', 'POSIXt')"

so we eliminate the first column (Date) as well:

tableplot(df.in[,c(-1,-9)])

and get

enter image description here

And for the 400,000 by 50 df2.in without the Date or character columns, the image rendering was quite quick (6 seconds):

system.time(tableplot(df2.in[,c(-(1+seq(0,40,10)), -(9+seq(0,40,10))) ]))

enter image description here

For the interested reader...

I present first a baby example on 50 rows, then an example on the 400,000 rows.

For what it's worth, I second the comment by @cmbarbu about visually looking at 400K rows on the same plot being limited by a screen that at best has 2K pixels in height, so some kind of breaking apart across pages might be beneficial to prevent overplotting. I include an attempt at this breaking apart by making a PDF document with 400 rows in 1000 plots/pages.

I do not know of a function that will render the requested plot with a data.frame being an input. My approach will make a matrix mask of the data.frame and then use lasagna() from the lasagnar package on github. lasagna() is a wrapper for the function image( t(X)[, (nrow(X):1)] ) where X is a matrix. This call reorders the rows so that they match the order of the data.frame, and the wrapper allows the ability to toggle grid lines and add legends (legend=TRUE will invoke image.plot( t(X)[, (nrow(X):1)] ) -- however, in the example below I explicitly add a legend not using image.plot()).

libraries for the task

library(fields)
library(colorspace)  
library(lubridate)
library(devtools)
install_github("swihart/lasagnar")
library(lasagnar)   

create a sample dataframe of 50 rows (baby example before 400K example)

df.in <- data.frame(date=seq(ymd('2012-04-07'),ymd('2013-03-22'), 
                    by = '1 week'),
           col1=rnorm(50),
           col2=rnorm(50),
           col3=rnorm(50),
           col4=rnorm(50),
           col5=as.factor(c("A","B")),
           col6=as.factor(c("MS","PHD")),
           col7=rnorm(50),
           col8=(c("cherlene","randy")),
           col9=rnorm(50),
           stringsAsFactors=FALSE)

induce missingness

df.in[19:23  , 2:4  ] <- NA
df.in[c(7, 9),      ] <- NA
df.in[2:30   , 4    ] <- NA
df.in[10     , 7    ] <- NA
df.in[14     , 6:10 ] <- NA

check structure

str(df.in)

prep the mask matrix

mat.out <- matrix(NA, nrow=nrow(df.in), ncol=ncol(df.in))

then cycle through columns for types; apply is.na() at the end

## red for dates
mat.out[,sapply(df.in,is.POSIXct)] <- 1
## blue for factors
mat.out[,sapply(df.in,is.factor)] <- 2
## green for characters
mat.out[,sapply(df.in,is.character)] <- 3
## white for numeric
mat.out[,sapply(df.in,is.numeric)] <- 4
## black for NA
mat.out[is.na(df.in)] <- 5

row names might be nice for tracing back to the original data

row.names(mat.out) <- 1:nrow(df.in)

render { lasagna(X) is a wrapper for image( t(X)[, (nrow(X):1)] ) }

lasagna(mat.out, col=c("red","blue","green","white","black"), 
        cex=0.67, main="")

enter image description here

legends are possible:

lasagna(mat.out, col=c("red","blue","green","white","black"), 
        cex=.67, main="")
legend("bottom", fill=c("red","blue","green","white","black"),
        legend=c("dates", "factors", "characters", "numeric", "NA"), 
        horiz=T, xpd=NA, inset=c(-.15), border="black")

enter image description here

turn gridlines off with gridlines=FALSE

lasagna(mat.out, col=c("red","blue","green","white","black"), 
        cex=.67, main="", gridlines=FALSE)
legend("bottom", fill=c("red","blue","green","white","black"),
        legend=c("dates", "factors", "characters", "numeric", "NA"), 
        horiz=T, xpd=NA, inset=c(-.15), border="black")

enter image description here

Let's do an example of OP data size: 400,000 rows x 50 cols

create a sample dataframe

df2.10 <- data.frame(date=seq(ymd('2012-04-07'),ymd('2013-03-22'), 
                    by = '1 week'),
           col1=rnorm(400000),
           col2=rnorm(400000),
           col3=rnorm(400000),
           col4=rnorm(400000),
           col5=as.factor(c("A","B")),
           col6=as.factor(c("MS","PHD")),
           col7=rnorm(400000),
           col8=(c("cherlene","randy")),
           col9=rnorm(400000),
           stringsAsFactors=FALSE)

induce missingness

df2.10[c(19:23), c(2:4)  ] <- NA
df2.10[c(7,  9),         ] <- NA
df2.10[c(2:30), 4        ] <- NA
df2.10[10     , 7        ] <- NA
df2.10[14     , c(6:10)  ] <- NA    
df2.10[c(450:750), ] <- NA
df2.10[c(399990:399999), ] <- NA

cbind into 50 column wide df; check structure

df2.in <- cbind(df2.10, df2.10, df2.10, df2.10, df2.10)
str(df2.in)

prep the mask matrix

mat.out <- matrix(NA, nrow=nrow(df2.in), ncol=ncol(df2.in))

then cycle through columns for types; apply is.na() at the end

## red for dates
mat.out[,sapply(df2.in,is.POSIXct)] <- 1
## blue for factors
mat.out[,sapply(df2.in,is.factor)] <- 2
## green for characters
mat.out[,sapply(df2.in,is.character)] <- 3
## white for numeric
mat.out[,sapply(df2.in,is.numeric)] <- 4
## black for NA
mat.out[is.na(df2.in)] <- 5

row names might be nice for tracing back to the original data

row.names(mat.out) <- 1:nrow(df2.in)

render { lasagna_plain(X) has no gridelines or rownames }

pdf("pages1000.pdf")
  system.time(
    for(i in 1:1000){
        lasagna_plain(mat.out[((i-1)*400+1):(400*i),],
                      col=c("red","blue","green","white","black"), cex=1, 
                      main=paste0("rows: ", (i-1)*400+1,  " - ",  (400*i)))
    }
  )
dev.off()

The for-loop completed 40 seconds on my machine, and the PDF very shortly thereafter. Now just page down after standardizing the page size in the PDF viewer, viewing pages/plots such as these:

enter image description hereenter image description hereenter image description here

like image 114
swihart Avatar answered Oct 13 '22 00:10

swihart


Give this a shot.

require(Amelia)
data(freetrade)
missmap(freetrade)

It won't do the red, blue green, but it gets your grid. I'd also give the VIM package a shot as it provides numerous options for visualizing missing data.

http://www.statistik.tuwien.ac.at/forschung/CS/CS-2008-1complete.pdf

like image 41
rwdvc Avatar answered Oct 13 '22 01:10

rwdvc