I have a large dataframe (400000 x 50) that I want to visually inspect for structure and blanks/gaps.
Is there an existing library or ggplot2 function, that can spit out a picture like this:
Where red might be "Dates", blue for "factors", green for "characters", and black for blanks/NAs.
Have you tried dfviewr
in lasagnar
? The following reproduces the desired graphic for the 50 row x 10 column df.in
in the package:
library(devtools)
install_github("swihart/lasagnar")
library(lasagnar)
dfviewr(df=df.in)
## also try:
##dfviewr(df=df.in, legend=FALSE)
##dfviewr(df=df.in, gridlines=FALSE)
So, to be fair, dfviewr
didn't exist at the time of the question, but to see some of the ideas that led to its development and how to actually visualize 400,000 rows, see the for-loop at the very bottom, and don't be too foolhardy and run the function on df2.in
(400,000 x 50):
## Do not run:
## system.time(dfviewr(df=df2.in, gridlines=FALSE)) ## 10 minutes before useRaster=TRUE
## 2 minutes after
Also, tabplot:::tableplot()
doesn't seem to support dates or characters:
library(tabplot)
tableplot(df.in)
produces:
Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : vmode 'character' not implemented
and so we eliminate the character column (#9):
tableplot(df.in[,c(-9)])
which produces:
Error in UseMethod("as.hi") :
no applicable method for 'as.hi' applied to an object of class "c('POSIXct', 'POSIXt')"
so we eliminate the first column (Date) as well:
tableplot(df.in[,c(-1,-9)])
and get
And for the 400,000 by 50 df2.in
without the Date or character columns, the image rendering was quite quick (6 seconds):
system.time(tableplot(df2.in[,c(-(1+seq(0,40,10)), -(9+seq(0,40,10))) ]))
I present first a baby example on 50 rows, then an example on the 400,000 rows.
For what it's worth, I second the comment by @cmbarbu about visually looking at 400K rows on the same plot being limited by a screen that at best has 2K pixels in height, so some kind of breaking apart across pages might be beneficial to prevent overplotting. I include an attempt at this breaking apart by making a PDF document with 400 rows in 1000 plots/pages.
I do not know of a function that will render the requested plot with a data.frame being an input. My approach will make a matrix mask of the data.frame and then use lasagna()
from the lasagnar
package on github. lasagna()
is a wrapper for the function image( t(X)[, (nrow(X):1)] )
where X
is a matrix. This call reorders the rows so that they match the order of the data.frame, and the wrapper allows the ability to toggle grid lines and add legends (legend=TRUE will invoke image.plot( t(X)[, (nrow(X):1)] )
-- however, in the example below I explicitly add a legend not using image.plot()).
library(fields)
library(colorspace)
library(lubridate)
library(devtools)
install_github("swihart/lasagnar")
library(lasagnar)
df.in <- data.frame(date=seq(ymd('2012-04-07'),ymd('2013-03-22'),
by = '1 week'),
col1=rnorm(50),
col2=rnorm(50),
col3=rnorm(50),
col4=rnorm(50),
col5=as.factor(c("A","B")),
col6=as.factor(c("MS","PHD")),
col7=rnorm(50),
col8=(c("cherlene","randy")),
col9=rnorm(50),
stringsAsFactors=FALSE)
df.in[19:23 , 2:4 ] <- NA
df.in[c(7, 9), ] <- NA
df.in[2:30 , 4 ] <- NA
df.in[10 , 7 ] <- NA
df.in[14 , 6:10 ] <- NA
str(df.in)
mat.out <- matrix(NA, nrow=nrow(df.in), ncol=ncol(df.in))
## red for dates
mat.out[,sapply(df.in,is.POSIXct)] <- 1
## blue for factors
mat.out[,sapply(df.in,is.factor)] <- 2
## green for characters
mat.out[,sapply(df.in,is.character)] <- 3
## white for numeric
mat.out[,sapply(df.in,is.numeric)] <- 4
## black for NA
mat.out[is.na(df.in)] <- 5
row.names(mat.out) <- 1:nrow(df.in)
lasagna(mat.out, col=c("red","blue","green","white","black"),
cex=0.67, main="")
lasagna(mat.out, col=c("red","blue","green","white","black"),
cex=.67, main="")
legend("bottom", fill=c("red","blue","green","white","black"),
legend=c("dates", "factors", "characters", "numeric", "NA"),
horiz=T, xpd=NA, inset=c(-.15), border="black")
lasagna(mat.out, col=c("red","blue","green","white","black"),
cex=.67, main="", gridlines=FALSE)
legend("bottom", fill=c("red","blue","green","white","black"),
legend=c("dates", "factors", "characters", "numeric", "NA"),
horiz=T, xpd=NA, inset=c(-.15), border="black")
df2.10 <- data.frame(date=seq(ymd('2012-04-07'),ymd('2013-03-22'),
by = '1 week'),
col1=rnorm(400000),
col2=rnorm(400000),
col3=rnorm(400000),
col4=rnorm(400000),
col5=as.factor(c("A","B")),
col6=as.factor(c("MS","PHD")),
col7=rnorm(400000),
col8=(c("cherlene","randy")),
col9=rnorm(400000),
stringsAsFactors=FALSE)
df2.10[c(19:23), c(2:4) ] <- NA
df2.10[c(7, 9), ] <- NA
df2.10[c(2:30), 4 ] <- NA
df2.10[10 , 7 ] <- NA
df2.10[14 , c(6:10) ] <- NA
df2.10[c(450:750), ] <- NA
df2.10[c(399990:399999), ] <- NA
df2.in <- cbind(df2.10, df2.10, df2.10, df2.10, df2.10)
str(df2.in)
mat.out <- matrix(NA, nrow=nrow(df2.in), ncol=ncol(df2.in))
## red for dates
mat.out[,sapply(df2.in,is.POSIXct)] <- 1
## blue for factors
mat.out[,sapply(df2.in,is.factor)] <- 2
## green for characters
mat.out[,sapply(df2.in,is.character)] <- 3
## white for numeric
mat.out[,sapply(df2.in,is.numeric)] <- 4
## black for NA
mat.out[is.na(df2.in)] <- 5
row.names(mat.out) <- 1:nrow(df2.in)
pdf("pages1000.pdf")
system.time(
for(i in 1:1000){
lasagna_plain(mat.out[((i-1)*400+1):(400*i),],
col=c("red","blue","green","white","black"), cex=1,
main=paste0("rows: ", (i-1)*400+1, " - ", (400*i)))
}
)
dev.off()
The for-loop completed 40 seconds on my machine, and the PDF very shortly thereafter. Now just page down after standardizing the page size in the PDF viewer, viewing pages/plots such as these:
Give this a shot.
require(Amelia)
data(freetrade)
missmap(freetrade)
It won't do the red, blue green, but it gets your grid. I'd also give the VIM package a shot as it provides numerous options for visualizing missing data.
http://www.statistik.tuwien.ac.at/forschung/CS/CS-2008-1complete.pdf
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With