I am performing k-means on a large dataset (636,688 rows x 7 columns) and have therefore turned to parallelization. My results need to be reproducible. I can do this using the clusterSetRNGStream
from the parallel
package. Here is an example using the Boston
dataset from the MASS
library:
library(parallel)
cl <- makeCluster(detectCores())
clusterSetRNGStream(cl, iseed = 1234)
clusterEvalQ(cl, library(MASS))
results <- clusterApply(cl, rep(25, 4), function(nstart) kmeans(Boston, 4, nstart = nstart))
check.results <- sapply(results, function(result) result$size)
stopCluster(cl)
Each check.results
column represents the number of observations per respective cluster for a given run-through of the k-means algorithm. My check.results
then looks like this:
[,1] [,2] [,3] [,4]
[1,] 38 268 102 102
[2,] 268 98 98 38
[3,] 98 102 38 268
[4,] 102 38 268 98
If I change my results
variable to include rep(25, 2)
instead of rep(25, 4)
, I get:
[,1] [,2]
[1,] 38 268
[2,] 268 98
[3,] 98 102
[4,] 102 38
Perfect - the first 2 runs' sizes remain the same, regardless of me running 4 iterations or only 2. If you continue changing the number of iterations, you will see that each individual run remains the same.
My question - how can I pick out e.g. the 4th run specifically without having to run the first 3 runs? Are there specific seeds saved under the underlying iseed
in clusterSetRNGStream
?
The clusterSetRNGStream
function doesn't support the kind of reproducibility that you want very well. The problem is that it simply initializes each of the cluster workers to draw random numbers from a different stream of random numbers which is reproducible when using clusterApply with a given number of workers. But to execute a particular task, you would have to execute it on the correct worker in order to get the correct stream, and fast forward in that stream, which isn't supported even if you know the exact number of random numbers consumed by each task.
Instead, I suggest that you use the lower level functions to assign a different substream of random numbers to each task. You can do that by generating the task seeds using the nextRNGSubStream
function:
library(parallel)
# This is based on the clusterSetRNGStream function from
# the parallel package, copyrighted by The R Core Team
getseeds <- function(ntasks, iseed) {
RNGkind("L'Ecuyer-CMRG")
set.seed(iseed)
seeds <- vector("list", ntasks)
seeds[[1]] <- .Random.seed
for (i in seq_len(ntasks - 1)) {
seeds[[i + 1]] <- nextRNGSubStream(seeds[[i]])
}
seeds
}
Since we're not using clusterSetRNGStream
, you need to set the random number generator to "L'Ecuyer-CMRG" when initializing the workers:
cl <- makeCluster(detectCores())
clusterEvalQ(cl, { library(MASS); RNGkind("L'Ecuyer-CMRG") })
The key is to set the value of ".Random.seed" from the worker function in order to use the correct random number substream for each task:
worker <- function(nstart, seed, centers=4) {
assign(".Random.seed", seed, envir=.GlobalEnv)
kmeans(Boston, centers, nstart = nstart)
}
Since we're iterating over both the nstart
and seed
values, you use clusterMap
rather than clusterApply
to execute the tasks:
n <- 4
nstarts <- rep(25, n)
seeds <- getseeds(n, 1234)
results <- clusterMap(cl, worker, nstarts, seeds)
To reproduce the results of the fourth task, you specify the fourth seed:
itasks <- c(4)
results <- clusterMap(cl, worker, nstarts[itasks], seeds[itasks])
Using this method, you get reproducible results even when load balancing via the clusterMap
.scheduling="dynamic"
argument since the results aren't dependent on the worker that executes the task as they are when using clusterSetRNGStream
.
Note that you can use the clusterMap
MoreArgs
argument to specify a value for the centers
argument of the worker
function:
results <- clusterMap(cl, worker, nstarts, seeds, MoreArgs=list(centers=5))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With