Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looping in R to create many plots when you have one extra variable

Tags:

loops

r

ggplot2

I often am faced with data that have too many categorical variables to satisfactorily plot on to one plot. When this situation arises, I write something to loop over a variable and save several plots specific to that variable.

This process is illustrated by the following example:

library(tidyr)
library(dplyr)
library(ggplot2)

mtcars <- add_rownames(mtcars, "car")

param<-unique(mtcars$cyl)
for (i in param)
{
mcplt <- mtcars %>% filter(cyl==i) %>% ggplot(aes(x=mpg, y=hp)) +
    geom_point() +
    facet_wrap(~car) +
    ggtitle(paste("Cylinder Type: ",i,sep=""))
  ggsave(mcplt, file=paste("Type",i,".jpeg",sep=""))
}

Whenever, I see references online to looping though, everyone always seems to indicate that looping usually not a good strategy in R. If this is the case, can anyone recommend a better way of achieving the same result as above? I'd be interested particularly in something faster as loops as SOOOO slow. But maybe the solution is that this is the best solution. I was just curious if anyone could improve upon this.

Thanks in advance.

like image 819
boshek Avatar asked Sep 26 '22 23:09

boshek


1 Answers

This is a well thought about topic for R, see SO posts here and here. Answers to this question highlight that *apply() alternatives to for() improve clarity, make parallelization easier, and under some circumstance speed up the problem. However, presumably your real question is ''how do I do this faster'' because it is taking long enough that you're unhappy. Inside your loop you are doing 3 distinct tasks.

  1. Break out a chunk of the dataframe using filter()
  2. Make a plot.
  3. Save the plot to a jpeg.

There are multiple ways to do all three of these steps, so let's try and evaluate all of them. I'll use the diamonds data from ggplot2 because it is bigger than the cars data. I hope differences in performance between methods will be noticeable this way. I learned alot from this chapter of Hadley Wickham's book on measuring performance.

So that I can use profiling I put the following block of code inside a function, and save that in a separate R file named for_solution.r.

f <- function(){
  param <- unique(diamonds$cut)
  for (i in param){
    mcplt <- diamonds %>% filter(cut==i) %>% ggplot(aes(x=carat, y=price)) +
      geom_point() +
      facet_wrap(~color) +
      ggtitle(paste("Cut: ",i,sep=""))
    ggsave(mcplt, file=paste("Cut",i,".jpeg",sep=""))
  }
}

and then I do:

library(dplyr)
library(ggplot2)
source("for_solution.r",keep.source=TRUE)
Rprof(line=TRUE)
f()
Rprof(NULL)
summaryRprof(lines="show")

Examining that output I see that the block of code is spending 97.25% of the time just saving the files. Examining the source for ggsave() I can see that function is doing alot of defensive programming to identify the type of output, then opening the graphics device, printing, and then closing the device. So I wonder if doing just that step manually would help. I'm also going to take advantage of the fact that a jpeg device will automatically produce new files for each page to only open and close the device once.

f1 <- function(){
  param <- unique(diamonds$cut)
  jpeg("cut%03d.jpg",width=par("din")[1],height=par("din")[2],units="in",res=300) # open the jpeg device, change defaults to match ggsave()
  for (i in param){
    mcplt <- diamonds %>% filter(cut==i) %>% ggplot(aes(x=carat, y=price)) +
      geom_point() +
      facet_wrap(~color) +
      ggtitle(paste("Cut: ",i,sep=""))
    print(mcplt)
  }
  dev.off()
}

and now profiling again

Rprof(line=TRUE)
f1()
Rprof(NULL)
summaryRprof(lines="show")

f1() still spends most of it's time on print(mcplt), and it is slightly faster than before (1.96 seconds compared to 2.18 seconds). One possible way to speed things up is to use a smaller device (less resolution or smaller image); when I used the defaults for jpeg() the difference was larger, more like 25% faster. I also tried changing the device to png() but that was no different.

Based on the profiling, I don't expect this to help, but for completeness I'm going to try doing away with the for loop and running everything inside dplyr with do(). I found this question and this one helpful here.

jpeg("cut%03d.jpg",width=par("din")[1],height=par("din")[2],units="in",res=300) # open the jpeg device, change defaults to match ggsave()
plots = diamonds %>% group_by(cut) %>% 
  do({plot=ggplot(aes(x=carat, y=price),data=.) +
      geom_point() +
      facet_wrap(~color) +
      ggtitle(paste("Cut: ",.$cut,sep="")) 
    print(plot)})

dev.off()

Running that code gives

Error: Results are not data frames at positions: 1, 2, 3

but it seems to work. I believe the error arises when do() returns because the print() method isn't returning a data.frame. Profiling it seems to indicate it runs a bit faster, 1.78 seconds overall. But I don't like solutions that generate errors, even if they aren't causing problems.

I have to stop here, but I've already learned a great deal about where to focus the attention. Other things to try would include:

  1. Using parallel or something similar to run each chunk of the dataframe in a separate process. I'm not sure that would help if the problem is saving the file, but if rendering the image is done by the CPU it would, I think.
  2. Try data.table instead of dplyr, but again, it's the printing part that's slow.
  3. Try Base graphics and lattice graphics and plotly instead of ggplot2. I've no idea about the relative speed, but it could vary.
  4. Buy a faster hard drive! I just compared the speed of f() on my home computer with a regular hard drive to my work machine with an SSD -- it's about 3x slower than the timings above.
like image 178
atiretoo Avatar answered Nov 15 '22 07:11

atiretoo