I have developed an R package that contains embarassingly parallel functions.
I would like to implement parallelization for these functions in a way that is transparent to the user, regardless of his/her OS (at least ideally).
I have looked around to see how other package authors have imported foreach-based Parallelism. For example, Max Kuhn's caret
package imports foreach
to use %dopar%
but relies on the user to specify a parallel backend. (Several examples use doMC
, which doesn't work on Windows.)
Noting that doParallel works for Windows and Linux/OSX and uses the built-in parallel
package (see comments here for useful discussion), does it make sense to import doParallel
and have my functions call registerDoParallel()
whenever the user specifies parallel=TRUE
as an argument?
I think it's very important to allow the user to register their own parallel backend. The doParallel
backend is very portable, but what if they want to run your function on multiple nodes of a cluster? What if they want to set the makeCluster
"outfile" option? It's unfortunate if making the parallel support transparent also makes it useless for many of your users.
I suggest that you use the getDoParRegistered
function to see if the user has already registered a parallel backend, and only register one for them if they haven't.
Here's an example:
library(doParallel)
parfun <- function(n=10, parallel=FALSE,
cores=getOption('mc.cores', 2L)) {
if (parallel) {
# honor registration made by user, and only create and register
# our own cluster object once
if (! getDoParRegistered()) {
cl <- makePSOCKcluster(cores)
registerDoParallel(cl)
message('Registered doParallel with ',
cores, ' workers')
} else {
message('Using ', getDoParName(), ' with ',
getDoParWorkers(), ' workers')
}
`%d%` <- `%dopar%`
} else {
message('Executing parfun sequentially')
`%d%` <- `%do%`
}
foreach(i=seq_len(n), .combine='c') %d% {
Sys.sleep(1)
i
}
}
This is written so that it only runs in parallel if parallel=TRUE
, even if they registered a parallel backend:
> parfun()
Executing parfun sequentially
[1] 1 2 3 4 5 6 7 8 9 10
If parallel=TRUE
and they haven't registered a backend, then it will create and register a cluster object for them:
> parfun(parallel=TRUE, cores=3)
Registered doParallel with 3 workers
[1] 1 2 3 4 5 6 7 8 9 10
If parfun
is called with parallel=TRUE
again, it will use the previously registered cluster:
> parfun(parallel=TRUE)
Using doParallelSNOW with 3 workers
[1] 1 2 3 4 5 6 7 8 9 10
This can be refined in many ways: it's just a simple demonstration. But at least it provides a convenience without preventing users from registering a different backend with custom options that might be necessary in their environment.
Note that the choice of a default number of cores/workers is also a tricky issue, and one that the CRAN maintainers care about. That is why I didn't make the default number of cores detectCores()
. Instead, I'm using the method used by mclapply
, although perhaps a different option name should be used.
Concerning stopCluster
Note that this example will sometimes create a new cluster object, but it never stops it via a call to stopCluster
. The reason is that creating cluster objects can be expensive, so I like to reuse them for multiple foreach loops, rather than create and destroy them each time. I'd rather leave that to the user, however, in this example, there isn't a way for the user to do that, since they don't have access to the cl
variable.
There are three ways to handle this:
stopCluster
in parfun
whenever makePSOCKcluster
is called;stopImplicitCluster
function in the doParallel
package);I would probably choose the second option for my own code, but that would significantly complicate this example. It's already rather complicated.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With