I've recently started using parallel techniques in R for a project and have my program working on Linux systems using mclapply from the parallel package. However, I've hit a road block with my understanding of parLapply
for Windows.
Using mclapply
I can set the number of cores, iterations, and pass that to an existing function in my workspace.
mclapply(1:8, function(z) adder(z, 100), mc.cores=4)
I don't seem to be able to achieve the same in windows using parLapply
. As I understand it, I need to pass all the variables through using clusterExport()
and pass the actual function I want to apply into the argument.
Is this correct or is there something similar to the mclapply
function that's applicable to Windows?
The beauty of mclapply
is that the worker processes are all created as clones of the master right at the point that mclapply
is called, so you don't have to worry about reproducing your environment on each of the cluster workers. Unfortunately, that isn't possible on Windows.
When using parLapply
, you generally have to perform the following additional steps:
Also, when you're done, it's good practice to shutdown the PSOCK cluster using stopCluster
.
Here's a translation of your example to parLapply
:
library(parallel) cl <- makePSOCKcluster(4) setDefaultCluster(cl) adder <- function(a, b) a + b clusterExport(NULL, c('adder')) parLapply(NULL, 1:8, function(z) adder(z, 100))
If your adder
function requires a package, you'll have to load that package on each of the workers before calling it with parLapply
. You can do that quite easily with clusterEvalQ
:
clusterEvalQ(NULL, library(MASS))
Note that the NULL
first argument to clusterExport
, clusterEval
and parLapply
indicates that they should use the cluster object registered via setDefaultCluster
. That can be very useful if your program is using mclapply
in many different functions, so that you don't have to pass the cluster object to every function that needs it when converting your program to use parLapply
.
Of course, adder
may call other functions in your global environment which call other functions, etc. In that case, you'll have to export them as well and load any packages that they need. Also note that if any variables that you've exported change during the course of your program, you will have to export them again in order to update them on the cluster workers. Again, that isn't necessary with mclapply
because it always creates/clones/forks the workers whenever it is called, making that unnecessary.
mclapply is simpler to use, and uses the underlying operating system fork() functionality to achieve parallelization. However, since Windows does not have fork(), it will run standard lapply instead - no parallelization.
parLapply is a different beast. It will create a cluster of processes, which could even reside on different machines on your network, and they communicate via TCP/IP in order to pass the tasks and results between each other.
The problem in your code there is that you didn't realize the first parameter to parLapply should be a "cluster" object. The simplest example of using parLapply to run on a single machine that I can think of is this:
library(parallel) # Spawn child processes using fork() on the local machine cl <- makeForkCluster(getOption("cl.cores", 2)) # Use parLapply to calculate lengths of 1000 strings text = rep("Hello, world!", 1000) len = parLapply(cl, text, nchar) # Kill child processes since they are no longer needed stopCluster(cl)
Using parLapply with a cluster created using makeForkCluster as above is functionally equivalent to calling mclapply. So it will also not work on Windows. :) Take a look at the other ways you can create a cluster with makeCluster and makePSOCKcluster on the documentation and check out what works best for your requirements.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With