Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Boosting ggplot2 performance

Tags:

The ggplot2 package is easily the best plotting system I ever worked with, except that the performance is not really good for larger datasets (~50k points). I'm looking into providing web analyses through Shiny, using ggplot2 as the plotting backend, but I'm not really happy with the performance, especially in contrast with base graphics. My question is if there any concrete ways to increase this performance.

The starting point is the following code example:

library(ggplot2)  n = 86400 # a day in seconds dat = data.frame(id = 1:n, val = sort(runif(n)))  dev.new()  gg_base = ggplot(dat, aes(x = id, y = val)) gg_point = gg_base + geom_point() gg_line = gg_base + geom_line() gg_both = gg_base + geom_point() + geom_line()  benchplot(gg_point) benchplot(gg_line) benchplot(gg_both) system.time(plot(dat)) system.time(plot(dat, type = 'l')) 

I get the following timings on my MacPro retina:

> benchplot(gg_point)        step user.self sys.self elapsed 1 construct     0.000    0.000   0.000 2     build     0.321    0.078   0.398 3    render     0.271    0.088   0.359 4      draw     2.013    0.018   2.218 5     TOTAL     2.605    0.184   2.975 > benchplot(gg_line)        step user.self sys.self elapsed 1 construct     0.000    0.000   0.000 2     build     0.330    0.073   0.403 3    render     0.622    0.095   0.717 4      draw     2.078    0.009   2.266 5     TOTAL     3.030    0.177   3.386 > benchplot(gg_both)        step user.self sys.self elapsed 1 construct     0.000    0.000   0.000 2     build     0.602    0.155   0.757 3    render     0.866    0.186   1.051 4      draw     4.020    0.030   4.238 5     TOTAL     5.488    0.371   6.046 > system.time(plot(dat))    user  system elapsed    1.133   0.004   1.138  # Note that the timing below depended heavily on wether or net the graphics device # was in view or not. Not in view made performance much, much better. > system.time(plot(dat, type = 'l'))    user  system elapsed    1.230   0.003   1.233  

Some more info on my setup:

> sessionInfo() R version 2.15.3 (2013-03-01) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)  locale: [1] C/UTF-8/C/C/C/C  attached base packages: [1] stats     graphics  grDevices utils     datasets  methods   base       other attached packages: [1] ggplot2_0.9.3.1  loaded via a namespace (and not attached):  [1] MASS_7.3-23        RColorBrewer_1.0-5 colorspace_1.2-1   dichromat_2.0-0     [5] digest_0.6.3       grid_2.15.3        gtable_0.1.2       labeling_0.1        [9] munsell_0.4        plyr_1.8           proto_0.3-10       reshape2_1.2.2     [13] scales_0.2.3       stringr_0.6.2      
like image 579
Paul Hiemstra Avatar asked Aug 21 '13 08:08

Paul Hiemstra


1 Answers

Hadley had a cool talk about his new packages dplyr and ggvis at user2013. But he can probably better tell more about that himself.

I'm not sure what your application design looks like, but I often do in-database pre-processing before feeding the data to R. For example, if you are plotting time series, there is really no need to show every second of the day on the X axis. Instead you might want to aggregate and get the min/max/mean over e.g. one or five minute time intervals.

Below an example of a function I wrote years ago that did something like that in SQL. This particular example uses the modulo operator because times were stored as epoch millis. But if data in SQL are properly stored as date/datetime structures, SQL has some more elegant native methods to aggregate by time periods.

#' @param table name of the table #' @param start start time/date #' @param end end time/date #' @param aggregate one of "days", "hours", "mins" or "weeks" #' @param group grouping variable #' @param column name of the target column (y axis) #' @export minmaxdata <- function(table, start, end, aggregate=c("days", "hours", "mins", "weeks"), group=1, column){    #dates   start <- round(unclass(as.POSIXct(start))*1000);   end <- round(unclass(as.POSIXct(end))*1000);    #must aggregate   aggregate <- match.arg(aggregate);    #calcluate modulus   mod <- switch(aggregate,     "mins"   = 1000*60,     "hours"  = 1000*60*60,     "days"   = 1000*60*60*24,     "weeks"  = 1000*60*60*24*7,     stop("invalid aggregate value")   );    #we need to add the time differene between gmt and pst to make modulo work   delta <- 1000 * 60 * 60 * (24 - unclass(as.POSIXct(format(Sys.time(), tz="GMT")) - Sys.time()));      #form query   query <- paste("SELECT", group, "AS grouping, AVG(", column, ") AS yavg, MAX(", column, ") AS ymax, MIN(", column, ") AS ymin, ((CMilliseconds_g +", delta, ") DIV", mod, ") AS timediv FROM", table, "WHERE CMilliseconds_g BETWEEN", start, "AND", end, "GROUP BY", group, ", timediv;")   mydata <- getquery(query);    #data   mydata$time <- structure(mod*mydata[["timediv"]]/1000 - delta/1000, class=c("POSIXct", "POSIXt"));   mydata$grouping <- as.factor(mydata$grouping)    #round timestamps   if(aggregate %in% c("mins", "hours")){     mydata$time <- round(mydata$time, aggregate)   } else {     mydata$time <- as.Date(mydata$time);   }    #return   return(mydata) } 
like image 97
Jeroen Ooms Avatar answered Oct 15 '22 21:10

Jeroen Ooms