Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What affects the time to create a cluster using the parallel package?

I'm experiencing slowness when creating clusters using the parallel package.

Here is a function that just creates and then stops a PSOCK cluster, with n nodes.

library(parallel)
library(microbenchmark)
f <- function(n)
{
  cl <- makeCluster(n)
  on.exit(stopCluster(cl))
}
microbenchmark(f(2), f(4), times = 10)
## Unit: seconds
##  expr      min       lq   median       uq      max neval
##  f(2) 4.095315 4.103224 4.206586 5.080307 5.991463    10
##  f(4) 8.150088 8.179489 8.391088 8.822470 9.226745    10   

My machine (a reasonably modern 4-core workstation running Win 7 Pro) is taking about 4 seconds to create a two node cluster and 8 seconds to create a four node cluster. This struck me as too slow, so I tried the same profiling on a colleague's identically specced machine, and it took one/two seconds for the two tests respectively.

This suggested I may have some odd configuration set up on my machine, or that there is some other problem. I read the ?makeCluster and socketConnection help pages, but did not see anything related to improving performance.

I had a look in the Windows Task Manager while the code was running: there was no obvious interference with anti-virus or other software, just an Rscript process running at ~17% (less than one core).

I don't know where to look to find the source of the problem. Are there any known causes of slowness with PSOCK cluster creation under Windows?

Is 8 seconds to create a 4-node cluster actually slow (by 2014 standards), or are my expectations too high?

like image 849
Richie Cotton Avatar asked May 08 '14 11:05

Richie Cotton


People also ask

Does R run in parallel?

Basically, parallelization allows you to run multiple processes in your code simultaneously, rather than than iterating over a list one element at a time, or running a single process at a time. Thankfully, running R code in parallel is relatively simple using the parallel package.

What is parallel package in R?

“parallel” Package The parallel package in R can perform tasks in parallel by providing the ability to allocate cores to R. The working involves finding the number of cores in the system and allocating all of them or a subset to make a cluster.

Is Lapply parallel?

lapply-based parallelism may be the most intuitively familiar way to parallelize tasks in R because it extend R's prolific lapply function.


1 Answers

To monitor what was happening, I installed and opened Process Monitor (HT @qethanm). I also exited most of the things in my system tray like Dropbox, in order to generate less noise. (Though in the end, this didn't make a difference.)

I then re-ran a simplified version of the R code in the question, directly from R GUI (instead of an IDE).

microbenchmark(f(4), times = 5)

After some digging, I noticed that R GUI spawns an Rscript process for each cluster that it creates (see picture).

the process tree shows an Rscript instance for each node in each cluster

After many dead ends and wild goose chases, it occurred to me that perhaps these Rscript instances weren't vanilla R. I renamed my Rprofile.site file to hide it and repeated the benchmark.

This time, a 4 node cluster was created, on average, in just under a second.

For a four node cluster, the Rprofile.site file (and presumably the personal startup file, ~/.Rprofile, if it exists) gets read four times, which can slow things down considerably. Pass rscript_args = c("--no-init-file", "--no-site-file", "--no-environ") to makeCluster to avoid this behaviour.

like image 138
Richie Cotton Avatar answered Oct 21 '22 19:10

Richie Cotton