Parallel processing in R - setting seed with mclapply() vs. pbmclapply()

Tags:

I'm parallelizing simulations in R (using mclapply() from the parallel package) and wanted to track my progress with each function call. So I instead decided to use pbmclapply() from the pbmcapply package in order to have a progress bar each time I run my simulations (pbmclapply() is specifically created as a wrapper for mclapply(), so they should have the same functionality except for the progress bar).

I was able to set a seed and get reproducible results without a problem using mclapply(), but pbmclapply() is giving me different results with each run, which I'm perplexed by. I've included a pretty simple reprex below.

For example, this is using mcapply():

## GIVES THE SAME RESULT EACH TIME IT IS RUN
library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
x <- mclapply(1:100, function(i) {rnorm(1)}, mc.cores = 2)
y <- do.call(rbind, x)
z <- mean(y)
print(mean(z))

And this is the same code using pbmclapply():

## GIVES DIFFERENT RESULTS EACH TIME IT IS RUN
library(pbmcapply)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
x <- pbmclapply(1:100, function(i) {rnorm(1)}, mc.cores = 2)
y <- do.call(rbind, x)
z <- mean(y)
print(mean(z))

The only difference between the two blocks of code above is the use of pbmclapply() in the second and mclapply() in the first, yet the first block gives me a consistent result every time I run it, and the second block gives different results each time it is run (though a seed is set in the same way).

What is the difference in the seeding procedure between these two functions? I would appreciate any feedback as to why this is happening. Thanks!

318

asked May 23 '21 02:05

bob

1 Answers

The issue is that in the utils.R file within the pbmcapply package it runs the following line:

if (isTRUE(mc.set.seed))
      mc.set.stream()

If we compare this to what is being called when we run the mclapply() function in the parallel package we see that it runs:

if (mc.set.seed) 
        mc.reset.stream()

This affects the results as reset stream will allow the code to be run from the globally set seed, whereas running set stream sets it to the a new random starting value using the initial seed. We can see this in the functions attached below:

mc.reset.stream <- function () 
{
    if (RNGkind()[1L] == "L'Ecuyer-CMRG") {
        if (!exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)) 
            sample.int(1L)
        # HERE! sets the seed to the global seed variable we set
        assign("LEcuyer.seed", get(".Random.seed", envir = .GlobalEnv, 
            inherits = FALSE), envir = RNGenv)
    }
}

mc.set.stream <- function () 
{
    if (RNGkind()[1L] == "L'Ecuyer-CMRG") {
        assign(".Random.seed", get("LEcuyer.seed", envir = RNGenv), 
            envir = .GlobalEnv)
    }
    else {
        if (exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)) 
            rm(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
    }
}

I believe this change may be due to an issue with mclapply when you want to call the mclapply function more than once after setting the seed it will use the same random numbers. (i.e. by resetting the r session you should get the same results in the same order with pbmclapply so first time I get 0.143 then 0.064 and then -0.015). This is usually the preferred behaviour so you can call the funciton multiple times. See R doesn't reset the seed when "L'Ecuyer-CMRG" RNG is used? for more information.

The differences between these two implementations can be tested with the following code if you change the line in the .customized_mcparallel funciton definition from mc.set.stream() to mc.reset.stream(). Here I have simplified the function calls in the package to strip out the progress bar and leave in only the calculation (removing error checks also) and the change in setting the random seed. (Additionally note these functions will no longer run on a Windows machine only Linux or MacOS).

library(pbmcapply)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
pbmclapply <- function()  {

  pkg <- asNamespace('pbmcapply')
  .cleanup <- get('.cleanup', pkg)


  progressMonitor <- .customized_mcparallel({

    mclapply(1:100, function(i) {
            rnorm(1)
        }, mc.cores = 2, mc.preschedule = TRUE, mc.set.seed = TRUE,
                       mc.cleanup = TRUE, mc.allow.recursive = TRUE)
  })

  # clean up processes on exit
  on.exit(.cleanup(progressMonitor$pid), add = T)

  # Retrieve the result
  results <- suppressWarnings(mccollect(progressMonitor$pid)[[as.character(progressMonitor$pid)]])

  return(results)
}

.customized_mcparallel <- function (expr, name, detached = FALSE){
  # loading hidden functions
  pkg <- asNamespace('parallel')
  mcfork <- get('mcfork', pkg)
  mc.advance.stream <- get('mc.advance.stream', pkg)
  mcexit <- get('mcexit', pkg)
  mcinteractive <- get('mcinteractive', pkg)
  sendMaster <- get('sendMaster', pkg)
  mc.set.stream <- get('mc.set.stream', pkg)
  mc.reset.stream <- get('mc.reset.stream', pkg)

  f <- mcfork(F)
  env <- parent.frame()
  mc.advance.stream()
  if (inherits(f, "masterProcess")) {

    mc.set.stream()
    # reset the group process id of the forked process
    mcinteractive(FALSE)

    sendMaster(try(eval(expr, env), silent = TRUE))
    mcexit(0L)
  }

  f
}

x <- pbmclapply()
y <- do.call(rbind, x)
z <- mean(y)
print(z)

For a complete remedy my best suggestion would be to either reimplement the functions in your own code (I copy pasted with some minor modifications to the functions from pbmcapply) or by forking the package and replacing the mc.set.seed in the utils.R file with mc.reset.seed. I can't think of a simpler solution at the moment, but hopefully this clarifies the issue.

198

answered Oct 16 '22 13:10

Joel Kandiah

Related questions
                            
                                From list to data frame with tidyverse, selecting specific list elements
                            
                                Assign group id to sequence of concecutive unique values in timeseries
                            
                                Query to Snowflake database isn't working because no active warehouse is selected
                            
                                Specify the dots argument when calling a tidyselect-using function without needing to specify the preceding arguments
                            
                                Conditionally determining value of column by looking at last group
                            
                                Button to view in full screen
                            
                                "Debug location is approximate because the source is not available" in R 4.0.0 + RStudio
                            
                                An analog to rnorm in python
                            
                                r summarize_if with multiple conditions
                            
                                R devtools unable to install - Ubuntu 20.04 - package or namespace load failed for ‘pkgload’
                            
                                Pause and resume caret training in R
                            
                                Numbered captions on customized and reactive figure in R markdown HTML file
                            
                                Standardize variables using dplyr [r]
                            
                                Fable: Extracting the p,d,q specification from an ARIMA model
                            
                                How to shade shapes
                            
                                Column to nested list separated by /
                            
                                How to select a specific tab in R Markdown?
                            
                                How to plot 'outside' of plotting area using ggplot in R?
                            
                                Passing files from a rocker container to a latex container within a gitlab-ci job
                            
                                How to use ifelse inside map function in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parallel processing in R - setting seed with mclapply() vs. pbmclapply()

Tags:

r

parallel-processing

random-seed

reproducible-research

mclapply

bob

People also ask

1 Answers

Joel Kandiah

Recent Activity

Donate For Us