Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Resetting R random number generator (rlecuyer) for inner loops using Snow/doSNOW

I have an outer foreach/dopar parallel loop containing an inner loop. Every instance of the inner loop should work on the same set of random numbers. The rest, i.e. the remaining parts of the outer body and the parallel instances should work as usual, i.e. with independent random numbers.

I can achieve this in a non-parallel implementation by saving the state of the RNG before the start of the inner loop and restoring this state after execution of each instance of the inner loop. See the following example:

library(doSNOW)

seed = 4711

cl = makeCluster(2)
registerDoSNOW(cl)
clusterSetupRNGstream (cl, seed=rep(seed,6))

erg = foreach(irun = 1:3,.combine = rbind) %dopar% {

  #do some random stuff in outer loop
  smp = runif(1)

  # save current state of RNG
  s = .Random.seed

  # inner loop, does some more random stuff
  idx = numeric(5)
  for(ii in seq.int(5)) {
    idx[ii] = sample.int(10, 1)
    # reset RNG for next loop iteration
    set.seed(s)
  }

  c(smp,idx)
}

> print(erg)
              [,1] [,2] [,3] [,4] [,5] [,6]
result.1 0.5749162    7    6    2    3    7
result.2 0.1208910    4    3    6    8    9
result.3 0.3491315    7    2    7    6   10

My desired output were constant integers along each row, different from row to row. So this does not work in parallel. The reason is quite clear: snow uses a different random generator (rlecuyer) and has to deal with parallel streams.

The question is: How can I achieve this reset of seed(s) in snow with the random number generator rlecuyer? One of the the challenges for me is to identify the proper substream one is in when the seed should be reset.

My current work-around is to pre-calculate all random stuff (in the example the idx vector) once for the inner loop and then use this constant data in all inner instances. This is not optimal since the random data in total becomes very large and it is much better to (re)generate it on the fly in smaller chunks.

Edit

It is of utmost importance that any setting/resetting keeps the parallel streams independent. As I currently understand the mechanism this requires:

For and within each parallel process:

  1. Detect the relevant substream the process is working on.
  2. At convenient times: Find and store the random seed of THIS substream.
  3. At convenient times: Reset substream from 1) with seed from 2) and leave all other streams undisturbed.
like image 641
g g Avatar asked Oct 15 '22 20:10

g g


1 Answers

There are a few things worth mentioning here. First of all, you're saving .Random.seed and then passing it directly to set.seed, which wouldn't give you the desired result even if you weren't having problems with the RNG. The reason for this is that, as documented, set.seed's seed argument only cares about a single integer, so if you pass a vector, it will still use only the 1st value, and the documentation of .Random.seed states:

.Random.seed is an integer vector whose first element codes the kind of RNG and normal generator.

So all your parallel streams would end up with the same integers because the first value of .Random.seed is always the same for the same RNG kind.

Secondly, the documentation also states:

.Random.seed saves the seed set for the uniform random-number generator, at least for the system generators. It does not necessarily save the state of other generators, and in particular does not save the state of the Box–Muller normal generator. If you want to reproduce work later, call set.seed (preferably with explicit values for kind and normal.kind) rather than set .Random.seed.

Which means you should definitely know what RNG you're using if you're going to be manipulating more internal values. If you call clusterEvalQ(cl, RNGkind()) after your call to clusterSetupRNGstream, you'll see that it returns:

[1] "user-supplied" "Inversion"     "Rejection"

So you can't assume that .Random.seed will be enough to save the RNG's state. In fact, I'm not even sure it plays nicely with doSNOW, see this discrepancy:

clusterSetupRNGstream(cl, seed = rep(seed, 6))
foo = foreach(irun = 1:2, .combine = list) %dopar% {
  list(
    .Random.seed,
    get(".Random.seed", .GlobalEnv)
  )
}
> str(foo)
List of 2
 $ :List of 2
  ..$ : int [1:626] 10403 1 -921191862 -372998484 563067311 -15494811 985677596 1278354290 1696669677 -1382461401 ...
  ..$ : int 10405
 $ :List of 2
  ..$ : int [1:626] 10403 1 -921191862 -372998484 563067311 -15494811 985677596 1278354290 1696669677 -1382461401 ...
  ..$ : int 10405

Finally, it's clear from your example that set.seed is not really working in this scenario. The documentation of clusterSetupRNGstream states that the rlecuyer package is used, and I don't know enough details about that package to say whether it supports set.seed or not, but I suppose it doesn't.

The only alternative I can give you is one that I've used before that is a bit more verbose:

RNGkind("L'Ecuyer-CMRG")

cl = makeCluster(2)
registerDoSNOW(cl)

seed = 4711L
set.seed(seed)
# x is a vector of length irun - 1
seeds = Reduce(x = 1:2, init = .Random.seed, accumulate = TRUE, f = function(x, ignored) {
  parallel::nextRNGStream(x)
})

erg = foreach(pseed = seeds, .combine = rbind) %dopar% {
  RNGkind("L'Ecuyer-CMRG")
  assign(".Random.seed", pseed, .GlobalEnv)

  # do some random stuff in outer loop
  smp = runif(1)
  
  # save current state of RNG
  s = get(".Random.seed", .GlobalEnv)
  
  # inner loop, does some more random stuff
  idx = numeric(5)
  for(ii in seq.int(5)) {
    idx[ii] = sample.int(10, 1)
    # reset RNG for next loop iteration
    assign(".Random.seed", s, .GlobalEnv)
  }
  
  c(smp, idx)
}

> print(erg)
              [,1] [,2] [,3] [,4] [,5] [,6]
result.1 0.9482966    2    2    2    2    2
result.2 0.1749918    3    3    3    3    3
result.3 0.3263343    1    1    1    1    1

AFAIK, the state of R's own L'Ecuyer-CMRG is indeed held in .Random.seed.

like image 125
Alexis Avatar answered Oct 20 '22 16:10

Alexis