The ggplot2
package is easily the best plotting system I ever worked with, except that the performance is not really good for larger datasets (~50k points). I'm looking into providing web analyses through Shiny, using ggplot2
as the plotting backend, but I'm not really happy with the performance, especially in contrast with base graphics. My question is if there any concrete ways to increase this performance.
The starting point is the following code example:
library(ggplot2) n = 86400 # a day in seconds dat = data.frame(id = 1:n, val = sort(runif(n))) dev.new() gg_base = ggplot(dat, aes(x = id, y = val)) gg_point = gg_base + geom_point() gg_line = gg_base + geom_line() gg_both = gg_base + geom_point() + geom_line() benchplot(gg_point) benchplot(gg_line) benchplot(gg_both) system.time(plot(dat)) system.time(plot(dat, type = 'l'))
I get the following timings on my MacPro retina:
> benchplot(gg_point) step user.self sys.self elapsed 1 construct 0.000 0.000 0.000 2 build 0.321 0.078 0.398 3 render 0.271 0.088 0.359 4 draw 2.013 0.018 2.218 5 TOTAL 2.605 0.184 2.975 > benchplot(gg_line) step user.self sys.self elapsed 1 construct 0.000 0.000 0.000 2 build 0.330 0.073 0.403 3 render 0.622 0.095 0.717 4 draw 2.078 0.009 2.266 5 TOTAL 3.030 0.177 3.386 > benchplot(gg_both) step user.self sys.self elapsed 1 construct 0.000 0.000 0.000 2 build 0.602 0.155 0.757 3 render 0.866 0.186 1.051 4 draw 4.020 0.030 4.238 5 TOTAL 5.488 0.371 6.046 > system.time(plot(dat)) user system elapsed 1.133 0.004 1.138 # Note that the timing below depended heavily on wether or net the graphics device # was in view or not. Not in view made performance much, much better. > system.time(plot(dat, type = 'l')) user system elapsed 1.230 0.003 1.233
Some more info on my setup:
> sessionInfo() R version 2.15.3 (2013-03-01) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C/UTF-8/C/C/C/C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ggplot2_0.9.3.1 loaded via a namespace (and not attached): [1] MASS_7.3-23 RColorBrewer_1.0-5 colorspace_1.2-1 dichromat_2.0-0 [5] digest_0.6.3 grid_2.15.3 gtable_0.1.2 labeling_0.1 [9] munsell_0.4 plyr_1.8 proto_0.3-10 reshape2_1.2.2 [13] scales_0.2.3 stringr_0.6.2
Hadley had a cool talk about his new packages dplyr and ggvis at user2013. But he can probably better tell more about that himself.
I'm not sure what your application design looks like, but I often do in-database pre-processing before feeding the data to R. For example, if you are plotting time series, there is really no need to show every second of the day on the X axis. Instead you might want to aggregate and get the min/max/mean over e.g. one or five minute time intervals.
Below an example of a function I wrote years ago that did something like that in SQL. This particular example uses the modulo operator because times were stored as epoch millis. But if data in SQL are properly stored as date/datetime structures, SQL has some more elegant native methods to aggregate by time periods.
#' @param table name of the table #' @param start start time/date #' @param end end time/date #' @param aggregate one of "days", "hours", "mins" or "weeks" #' @param group grouping variable #' @param column name of the target column (y axis) #' @export minmaxdata <- function(table, start, end, aggregate=c("days", "hours", "mins", "weeks"), group=1, column){ #dates start <- round(unclass(as.POSIXct(start))*1000); end <- round(unclass(as.POSIXct(end))*1000); #must aggregate aggregate <- match.arg(aggregate); #calcluate modulus mod <- switch(aggregate, "mins" = 1000*60, "hours" = 1000*60*60, "days" = 1000*60*60*24, "weeks" = 1000*60*60*24*7, stop("invalid aggregate value") ); #we need to add the time differene between gmt and pst to make modulo work delta <- 1000 * 60 * 60 * (24 - unclass(as.POSIXct(format(Sys.time(), tz="GMT")) - Sys.time())); #form query query <- paste("SELECT", group, "AS grouping, AVG(", column, ") AS yavg, MAX(", column, ") AS ymax, MIN(", column, ") AS ymin, ((CMilliseconds_g +", delta, ") DIV", mod, ") AS timediv FROM", table, "WHERE CMilliseconds_g BETWEEN", start, "AND", end, "GROUP BY", group, ", timediv;") mydata <- getquery(query); #data mydata$time <- structure(mod*mydata[["timediv"]]/1000 - delta/1000, class=c("POSIXct", "POSIXt")); mydata$grouping <- as.factor(mydata$grouping) #round timestamps if(aggregate %in% c("mins", "hours")){ mydata$time <- round(mydata$time, aggregate) } else { mydata$time <- as.Date(mydata$time); } #return return(mydata) }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With