Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Seed and clusterApply - how to select a specific run?

I am performing k-means on a large dataset (636,688 rows x 7 columns) and have therefore turned to parallelization. My results need to be reproducible. I can do this using the clusterSetRNGStream from the parallel package. Here is an example using the Boston dataset from the MASS library:

library(parallel)
cl <- makeCluster(detectCores())
clusterSetRNGStream(cl, iseed = 1234)
clusterEvalQ(cl, library(MASS))
results <- clusterApply(cl, rep(25, 4), function(nstart) kmeans(Boston, 4, nstart = nstart))
check.results <- sapply(results, function(result) result$size)
stopCluster(cl)

Each check.results column represents the number of observations per respective cluster for a given run-through of the k-means algorithm. My check.results then looks like this:

     [,1] [,2] [,3] [,4]
[1,]   38  268  102  102
[2,]  268   98   98   38
[3,]   98  102   38  268
[4,]  102   38  268   98

If I change my results variable to include rep(25, 2) instead of rep(25, 4), I get:

     [,1] [,2]
[1,]   38  268
[2,]  268   98
[3,]   98  102
[4,]  102   38

Perfect - the first 2 runs' sizes remain the same, regardless of me running 4 iterations or only 2. If you continue changing the number of iterations, you will see that each individual run remains the same.

My question - how can I pick out e.g. the 4th run specifically without having to run the first 3 runs? Are there specific seeds saved under the underlying iseed in clusterSetRNGStream?

like image 398
Anna Dunietz Avatar asked Feb 04 '14 18:02

Anna Dunietz


1 Answers

The clusterSetRNGStream function doesn't support the kind of reproducibility that you want very well. The problem is that it simply initializes each of the cluster workers to draw random numbers from a different stream of random numbers which is reproducible when using clusterApply with a given number of workers. But to execute a particular task, you would have to execute it on the correct worker in order to get the correct stream, and fast forward in that stream, which isn't supported even if you know the exact number of random numbers consumed by each task.

Instead, I suggest that you use the lower level functions to assign a different substream of random numbers to each task. You can do that by generating the task seeds using the nextRNGSubStream function:

library(parallel)
# This is based on the clusterSetRNGStream function from
# the parallel package, copyrighted by The R Core Team
getseeds <- function(ntasks, iseed) {
  RNGkind("L'Ecuyer-CMRG")
  set.seed(iseed)
  seeds <- vector("list", ntasks)
  seeds[[1]] <- .Random.seed
  for (i in seq_len(ntasks - 1)) {
    seeds[[i + 1]] <- nextRNGSubStream(seeds[[i]])
  }
  seeds
}

Since we're not using clusterSetRNGStream, you need to set the random number generator to "L'Ecuyer-CMRG" when initializing the workers:

cl <- makeCluster(detectCores())
clusterEvalQ(cl, { library(MASS); RNGkind("L'Ecuyer-CMRG") })

The key is to set the value of ".Random.seed" from the worker function in order to use the correct random number substream for each task:

worker <- function(nstart, seed, centers=4) {
  assign(".Random.seed", seed, envir=.GlobalEnv)
  kmeans(Boston, centers, nstart = nstart)
}

Since we're iterating over both the nstart and seed values, you use clusterMap rather than clusterApply to execute the tasks:

n <- 4
nstarts <- rep(25, n)
seeds <- getseeds(n, 1234)
results <- clusterMap(cl, worker, nstarts, seeds)

To reproduce the results of the fourth task, you specify the fourth seed:

itasks <- c(4)
results <- clusterMap(cl, worker, nstarts[itasks], seeds[itasks])

Using this method, you get reproducible results even when load balancing via the clusterMap .scheduling="dynamic" argument since the results aren't dependent on the worker that executes the task as they are when using clusterSetRNGStream.


Note that you can use the clusterMap MoreArgs argument to specify a value for the centers argument of the worker function:

results <- clusterMap(cl, worker, nstarts, seeds, MoreArgs=list(centers=5))
like image 137
Steve Weston Avatar answered Nov 01 '22 08:11

Steve Weston