Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best practice for making functions in my R package parallelizable?

I have developed an R package that contains embarassingly parallel functions.

I would like to implement parallelization for these functions in a way that is transparent to the user, regardless of his/her OS (at least ideally).

I have looked around to see how other package authors have imported foreach-based Parallelism. For example, Max Kuhn's caret package imports foreach to use %dopar% but relies on the user to specify a parallel backend. (Several examples use doMC, which doesn't work on Windows.)

Noting that doParallel works for Windows and Linux/OSX and uses the built-in parallel package (see comments here for useful discussion), does it make sense to import doParallel and have my functions call registerDoParallel() whenever the user specifies parallel=TRUE as an argument?

like image 761
C8H10N4O2 Avatar asked Feb 21 '17 17:02

C8H10N4O2


1 Answers

I think it's very important to allow the user to register their own parallel backend. The doParallel backend is very portable, but what if they want to run your function on multiple nodes of a cluster? What if they want to set the makeCluster "outfile" option? It's unfortunate if making the parallel support transparent also makes it useless for many of your users.

I suggest that you use the getDoParRegistered function to see if the user has already registered a parallel backend, and only register one for them if they haven't.

Here's an example:

library(doParallel)
parfun <- function(n=10, parallel=FALSE,
                   cores=getOption('mc.cores', 2L)) {
  if (parallel) {
    # honor registration made by user, and only create and register
    # our own cluster object once
    if (! getDoParRegistered()) {
      cl <- makePSOCKcluster(cores)
      registerDoParallel(cl)
      message('Registered doParallel with ',
              cores, ' workers')
    } else {
      message('Using ', getDoParName(), ' with ',
              getDoParWorkers(), ' workers')
    }
    `%d%` <- `%dopar%`
  } else {
    message('Executing parfun sequentially')
    `%d%` <- `%do%`
  }

  foreach(i=seq_len(n), .combine='c') %d% {
    Sys.sleep(1)
    i
  }
}

This is written so that it only runs in parallel if parallel=TRUE, even if they registered a parallel backend:

> parfun()
Executing parfun sequentially
 [1]  1  2  3  4  5  6  7  8  9 10

If parallel=TRUE and they haven't registered a backend, then it will create and register a cluster object for them:

> parfun(parallel=TRUE, cores=3)
Registered doParallel with 3 workers
 [1]  1  2  3  4  5  6  7  8  9 10

If parfun is called with parallel=TRUE again, it will use the previously registered cluster:

> parfun(parallel=TRUE)
Using doParallelSNOW with 3 workers
 [1]  1  2  3  4  5  6  7  8  9 10

This can be refined in many ways: it's just a simple demonstration. But at least it provides a convenience without preventing users from registering a different backend with custom options that might be necessary in their environment.


Note that the choice of a default number of cores/workers is also a tricky issue, and one that the CRAN maintainers care about. That is why I didn't make the default number of cores detectCores(). Instead, I'm using the method used by mclapply, although perhaps a different option name should be used.


Concerning stopCluster

Note that this example will sometimes create a new cluster object, but it never stops it via a call to stopCluster. The reason is that creating cluster objects can be expensive, so I like to reuse them for multiple foreach loops, rather than create and destroy them each time. I'd rather leave that to the user, however, in this example, there isn't a way for the user to do that, since they don't have access to the cl variable.

There are three ways to handle this:

  • Call stopCluster in parfun whenever makePSOCKcluster is called;
  • Write an additional function that allows the user to stop the implicitly created cluster object (equivalent to the stopImplicitCluster function in the doParallel package);
  • Don't worry about the implicitly created cluster object.

I would probably choose the second option for my own code, but that would significantly complicate this example. It's already rather complicated.

like image 196
Steve Weston Avatar answered Nov 06 '22 21:11

Steve Weston