Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel processing in R - setting seed with mclapply() vs. pbmclapply()

I'm parallelizing simulations in R (using mclapply() from the parallel package) and wanted to track my progress with each function call. So I instead decided to use pbmclapply() from the pbmcapply package in order to have a progress bar each time I run my simulations (pbmclapply() is specifically created as a wrapper for mclapply(), so they should have the same functionality except for the progress bar).

I was able to set a seed and get reproducible results without a problem using mclapply(), but pbmclapply() is giving me different results with each run, which I'm perplexed by. I've included a pretty simple reprex below.

For example, this is using mcapply():

## GIVES THE SAME RESULT EACH TIME IT IS RUN
library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
x <- mclapply(1:100, function(i) {rnorm(1)}, mc.cores = 2)
y <- do.call(rbind, x)
z <- mean(y)
print(mean(z))

And this is the same code using pbmclapply():

## GIVES DIFFERENT RESULTS EACH TIME IT IS RUN
library(pbmcapply)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
x <- pbmclapply(1:100, function(i) {rnorm(1)}, mc.cores = 2)
y <- do.call(rbind, x)
z <- mean(y)
print(mean(z))

The only difference between the two blocks of code above is the use of pbmclapply() in the second and mclapply() in the first, yet the first block gives me a consistent result every time I run it, and the second block gives different results each time it is run (though a seed is set in the same way).

What is the difference in the seeding procedure between these two functions? I would appreciate any feedback as to why this is happening. Thanks!

like image 318
bob Avatar asked May 23 '21 02:05

bob


People also ask

What is Mclapply in R?

mclapply is a parallelized version of lapply , it returns a list of the same length as X , each element of which is the result of applying FUN to the corresponding element of X . It relies on forking and hence is not available on Windows unless mc.

What is parallel processing in R?

Parallel processing (in the extreme) means that all the f# processes start simultaneously and run to completion on their own. If we have a single computer at our disposal and have to run n models, each taking s seconds, the total running time will be n*s .

Does R support parallel computing?

There are various packages in R which allow parallelization. “parallel” Package The parallel package in R can perform tasks in parallel by providing the ability to allocate cores to R. The working involves finding the number of cores in the system and allocating all of them or a subset to make a cluster.


1 Answers

The issue is that in the utils.R file within the pbmcapply package it runs the following line:

if (isTRUE(mc.set.seed))
      mc.set.stream()

If we compare this to what is being called when we run the mclapply() function in the parallel package we see that it runs:

if (mc.set.seed) 
        mc.reset.stream()

This affects the results as reset stream will allow the code to be run from the globally set seed, whereas running set stream sets it to the a new random starting value using the initial seed. We can see this in the functions attached below:

mc.reset.stream <- function () 
{
    if (RNGkind()[1L] == "L'Ecuyer-CMRG") {
        if (!exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)) 
            sample.int(1L)
        # HERE! sets the seed to the global seed variable we set
        assign("LEcuyer.seed", get(".Random.seed", envir = .GlobalEnv, 
            inherits = FALSE), envir = RNGenv)
    }
}

mc.set.stream <- function () 
{
    if (RNGkind()[1L] == "L'Ecuyer-CMRG") {
        assign(".Random.seed", get("LEcuyer.seed", envir = RNGenv), 
            envir = .GlobalEnv)
    }
    else {
        if (exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)) 
            rm(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
    }
}

I believe this change may be due to an issue with mclapply when you want to call the mclapply function more than once after setting the seed it will use the same random numbers. (i.e. by resetting the r session you should get the same results in the same order with pbmclapply so first time I get 0.143 then 0.064 and then -0.015). This is usually the preferred behaviour so you can call the funciton multiple times. See R doesn't reset the seed when "L'Ecuyer-CMRG" RNG is used? for more information.

The differences between these two implementations can be tested with the following code if you change the line in the .customized_mcparallel funciton definition from mc.set.stream() to mc.reset.stream(). Here I have simplified the function calls in the package to strip out the progress bar and leave in only the calculation (removing error checks also) and the change in setting the random seed. (Additionally note these functions will no longer run on a Windows machine only Linux or MacOS).

library(pbmcapply)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
pbmclapply <- function()  {

  pkg <- asNamespace('pbmcapply')
  .cleanup <- get('.cleanup', pkg)


  progressMonitor <- .customized_mcparallel({

    mclapply(1:100, function(i) {
            rnorm(1)
        }, mc.cores = 2, mc.preschedule = TRUE, mc.set.seed = TRUE,
                       mc.cleanup = TRUE, mc.allow.recursive = TRUE)
  })

  # clean up processes on exit
  on.exit(.cleanup(progressMonitor$pid), add = T)

  # Retrieve the result
  results <- suppressWarnings(mccollect(progressMonitor$pid)[[as.character(progressMonitor$pid)]])

  return(results)
}

.customized_mcparallel <- function (expr, name, detached = FALSE){
  # loading hidden functions
  pkg <- asNamespace('parallel')
  mcfork <- get('mcfork', pkg)
  mc.advance.stream <- get('mc.advance.stream', pkg)
  mcexit <- get('mcexit', pkg)
  mcinteractive <- get('mcinteractive', pkg)
  sendMaster <- get('sendMaster', pkg)
  mc.set.stream <- get('mc.set.stream', pkg)
  mc.reset.stream <- get('mc.reset.stream', pkg)

  f <- mcfork(F)
  env <- parent.frame()
  mc.advance.stream()
  if (inherits(f, "masterProcess")) {

    mc.set.stream()
    # reset the group process id of the forked process
    mcinteractive(FALSE)

    sendMaster(try(eval(expr, env), silent = TRUE))
    mcexit(0L)
  }

  f
}

x <- pbmclapply()
y <- do.call(rbind, x)
z <- mean(y)
print(z)

For a complete remedy my best suggestion would be to either reimplement the functions in your own code (I copy pasted with some minor modifications to the functions from pbmcapply) or by forking the package and replacing the mc.set.seed in the utils.R file with mc.reset.seed. I can't think of a simpler solution at the moment, but hopefully this clarifies the issue.

like image 198
Joel Kandiah Avatar answered Oct 16 '22 13:10

Joel Kandiah