Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nesting parallel functions in R (

I'm familiar with foreach, %dopar% and the like. I am also familiar with the parallel option for cv.glmnet. But how do you set up the nested parallelistion as below?

library(glmnet)
library(foreach)
library(parallel)
library(doSNOW)
Npar <- 1000
Nobs <- 200
Xdat <- matrix(rnorm(Nobs * Npar), ncol = Npar)
Xclass <- rep(1:2, each = Nobs/2)
Ydat <- rnorm(Nobs)

Parallel cross-validation:

cl <- makeCluster(8, type = "SOCK")
registerDoSNOW(cl)
system.time(mods <- foreach(x = 1:2, .packages = "glmnet") %dopar% {
    idx <- Xclass == x
    cv.glmnet(Xdat[idx,], Ydat[idx], nfolds = 4, parallel = TRUE)
})
stopCluster(cl)

Not parallel cross-validation:

cl <- makeCluster(8, type = "SOCK")
registerDoSNOW(cl)
system.time(mods <- foreach(x = 1:2, .packages = "glmnet") %dopar% {
    idx <- Xclass == x
    cv.glmnet(Xdat[idx,], Ydat[idx], nfolds = 4, parallel = FALSE)
})
stopCluster(cl)

For the two system times I am only getting a very marginal difference.

Is parallelistion taken are of? Or do I need to use the nested operator explicitly?

Side-question: If 8 cores are available in a cluster object and the foreach loop contains two tasks, will each task be given 1 core (and the other 6 cores left idle) or will each task be given four cores (using up all 8 cores in total)? What's the way to query how many cores are being used at a given time?

like image 445
dynamo Avatar asked Oct 20 '22 16:10

dynamo


1 Answers

In your parallel cross-validation example, cv.glmnet itself will not run in parallel because there is no foreach parallel backend registered in the cluster workers. The outer foreach loop will run in parallel, but not the foreach loop in the cv.glmnet function.

To use doSNOW for the outer and inner foreach loops, you could initialize the snow cluster workers using clusterCall:

cl <- makeCluster(2, type = "SOCK")
clusterCall(cl, function() {
  library(doSNOW)
  registerDoSNOW(makeCluster(2, type = "SOCK"))
  NULL
})
registerDoSNOW(cl)

This registers doSNOW for both the master and the workers so that each call to cv.glmnet will execute on a two-worker cluster when parallel=TRUE is specified.

The trick with nested parallelism is to avoid creating too many processes and oversubscribing the CPU (or CPUs), so you need to be careful when registering the parallel backends. My example makes sense for a CPU with four cores even though a total of six workers are created, since the "outer" workers don't do much while the inner foreach loops execute. It is common when running on a cluster to use doSNOW to start one worker per node, and then use doMC to start one worker per core on each of those nodes.

Note that your example doesn't use much compute time, so it's not really worthwhile to use two levels of parallelism. I would use a much bigger problem in order to determine the benefits of the different approaches.

like image 74
Steve Weston Avatar answered Oct 23 '22 23:10

Steve Weston