What affects the time to create a cluster using the parallel package?

Tags:

parallel-processing

I'm experiencing slowness when creating clusters using the parallel package.

Here is a function that just creates and then stops a PSOCK cluster, with n nodes.

library(parallel)
library(microbenchmark)
f <- function(n)
{
  cl <- makeCluster(n)
  on.exit(stopCluster(cl))
}
microbenchmark(f(2), f(4), times = 10)
## Unit: seconds
##  expr      min       lq   median       uq      max neval
##  f(2) 4.095315 4.103224 4.206586 5.080307 5.991463    10
##  f(4) 8.150088 8.179489 8.391088 8.822470 9.226745    10

My machine (a reasonably modern 4-core workstation running Win 7 Pro) is taking about 4 seconds to create a two node cluster and 8 seconds to create a four node cluster. This struck me as too slow, so I tried the same profiling on a colleague's identically specced machine, and it took one/two seconds for the two tests respectively.

This suggested I may have some odd configuration set up on my machine, or that there is some other problem. I read the ?makeCluster and socketConnection help pages, but did not see anything related to improving performance.

I had a look in the Windows Task Manager while the code was running: there was no obvious interference with anti-virus or other software, just an Rscript process running at ~17% (less than one core).

I don't know where to look to find the source of the problem. Are there any known causes of slowness with PSOCK cluster creation under Windows?

Is 8 seconds to create a 4-node cluster actually slow (by 2014 standards), or are my expectations too high?

849

asked May 08 '14 11:05

Richie Cotton

1 Answers

To monitor what was happening, I installed and opened Process Monitor (HT @qethanm). I also exited most of the things in my system tray like Dropbox, in order to generate less noise. (Though in the end, this didn't make a difference.)

I then re-ran a simplified version of the R code in the question, directly from R GUI (instead of an IDE).

microbenchmark(f(4), times = 5)

After some digging, I noticed that R GUI spawns an Rscript process for each cluster that it creates (see picture).

the process tree shows an Rscript instance for each node in each cluster

After many dead ends and wild goose chases, it occurred to me that perhaps these Rscript instances weren't vanilla R. I renamed my Rprofile.site file to hide it and repeated the benchmark.

This time, a 4 node cluster was created, on average, in just under a second.

For a four node cluster, the Rprofile.site file (and presumably the personal startup file, ~/.Rprofile, if it exists) gets read four times, which can slow things down considerably. Pass rscript_args = c("--no-init-file", "--no-site-file", "--no-environ") to makeCluster to avoid this behaviour.

138

answered Oct 21 '22 19:10

Richie Cotton

Related questions
                            
                                Texture in barplot for 7 bars in R?
                            
                                S3 dispatching of `rbind` and `cbind`
                            
                                How to format a complex table for rmarkdown PDF output
                            
                                R Shiny DateRange Input month year only
                            
                                What are the dangers of using R attributes?
                            
                                Is this the expected behavior
                            
                                Reduce memory footprint of data.table with highly repeated key
                            
                                Creating a regular polygon grid over a spatial extent, rotated by a given angle
                            
                                how do I end a dplyr pipe with NULL? to allow easy comment/uncomment
                            
                                Set one or more of coefficients to a specific integer
                            
                                How to change knitr options mid chunk
                            
                                'x' is a list, but does not have components 'x' and 'y'
                            
                                Sum percentages for each facet - respect "fill"
                            
                                How can I efficiently save a python pandas dataframe in hdf5 and open it as a dataframe in R?
                            
                                53rd week of the year in R?
                            
                                Specifying column with its index rather than name
                            
                                Width of error bars in line plot using ggplot2
                            
                                Unicode normalization (form C) in R : convert all characters with accents into their one-unicode-character form?
                            
                                Combine two lists of dataframes, dataframe by dataframe
                            
                                RDA, Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric, when data is numeric?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What affects the time to create a cluster using the parallel package?

Tags:

r

parallel-processing

Richie Cotton

People also ask

1 Answers

Richie Cotton

Recent Activity

Donate For Us