Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Setting seed for nested parallel simulation in R and storing the states of the random-number generator

In R, I am parallelising my simulation using packages foreach, doFuture, and doRNG.
I have two nested foreach loop: the inner loop generate and analyse data for each iteration, and the outer loop repeats that procedure for different scenarios of parameter values. A simple example of my code structure is as follows:

library(foreach)
library(doFuture)
library(doRNG)
library(data.table)

plan(multisession, workers = 2)
registerDoRNG()
set.seed(123)
scenarios <- data.table(x = c(0.1, 0.2, 0.3), scenario = 1:3)

out_dir <- getwd()

results <- foreach(s = 1:3, .combine = rbind) %do% {
  this_x <- scenarios$x[s]
  scen_id <- scenarios$scenario[s]
  
  sim_res <- foreach(i = 1:100, .combine = rbind,
                     .options.future = list(seed = TRUE)) %dofuture% {
           y <- rnorm(1, this_x)
           data.frame(rep = i, y = y)
              
  }
  out_file <- file.path(out_dir, paste0("scenario_", scen_id, ".csv"))
  fwrite(sim_res, out_file)
  
    summary <- data.frame(scenario = scen_id, mean_y = mean(sim_res$y))  
                        summary
}

print(results)

While the registerDoRNG() function ensures safe reproducibility for parallel session, I have two issues that I would like your help:

1. The results of each scenario depends on the order that the scenario is run. So e.g. the results of scenario 1 and 3 when I use foreach(s = 1:3...) will be different from those when I use foreach(s = 3:1...)

2. I have not figured out how to store the state of the random-number generator at each iteration. This would be helpful if for example my R crash in the middle of the simulation and I want to start from where I left off, or if I want to investigate the results of a particular iteration.

I thought of setting a different seed for each iteration, like in here, but I read from here that this approach does not guarantee high-quality random numbers.

like image 247
Trang Hien Avatar asked Aug 31 '25 03:08

Trang Hien


1 Answers

For the record, from the second link you posted, the "not good" use of

seeds <- sample.int(n = 10)
y <- parLapply(cl, seeds, function(seed) {
  set.seed(seed)
  runif(n = 5)
})

might be "not good" because in this case it only seeds with integers 1 to 10:

sample.int(n=10)
#  [1]  7  5 10  2  9  8  4  3  1  6

I agree with the assessment that that is not a good approach to controlling random seeds.

In my slightly different approach, we pull from a much larger set of seeds,

sample.int(.Machine$integer.max, size=10)
#  [1]  397286749 1052135137 1011271536  329404477 1906443373 1687197862 2021799661 1642042756 2121729063 1740460742

which will support a much broader surface of randomness. Since it relies on integers for seeding the random pool, it's certainly "ideally a little less random" than a non-seeded approach, but .Machine$integer.max is very large and supports the vast majority of what one might need.


I suggest we can use this for a very robust method that allows you to store the seed in your results.

The most detailed resolution would be to set a seed for each inner loop:

outerloop <- 3; innerloop <- 100 # for clarity/reference here
seeds <- replicate(outerloop,
                   sample.int(.Machine$integer.max, size = innerloop),
                   simplify = FALSE)
results <- foreach(s = 1:outerloop, .combine = rbind) %do% {
  # ...
  sim_res <- foreach(i = 1:innerloop, .combine = rbind,
                     .options.future = list(seed = TRUE)) %dofuture% {
           set.seed(seeds[[s]][i])
           # ...
  }
  # ...
}

I think that's what you're targeting, but in case you only want/need to set it in the outer loop, then it can be a bit simpler:

seeds <- sample.int(.Machine$integer.max, size = outerloop)
results <- foreach(s = 1:innerloop, .combine = rbind) %do% {
  set.seed(seeds[s])
  # ...
}

You can choose how/where to embed the seeds[s] within results.


Another approach is to use sample.int(.) inside each simulation and accompany its value in the output; for frames, this might be as an attribute (perhaps fragile) or as an invariant column (inefficient but rugged).

results <- foreach(s = 1:outerloop, .combine = rbind) %do% {
  # ...
  sim_res <- foreach(i = 1:innerloop, .combine = rbind,
                     .options.future = list(seed = TRUE)) %dofuture% {
           seed <- sample.int(.Machine$integer.max, size=1)
           res <- ...
           ### store seed as an attribute of the frame ...
           attr(res, "seed") <- seed
           ### ... or as a column
           res$seed <- seed
           res
  }
  # ...
}

This works in effectively the same way, though I often prefer the pre-pulled seeds for two reasons:

  • I find storing random seeds separately to be slightly safer: if I want to check that a change to my code fixed a bug, I cannot do it unless I have control over the seeds going in;

  • if you're doing it many many times, it can be measurably slower to do it one at a time, in the below case it's 183x slower. Generally I won't worry about this premature optimization unless (a) the inner calc is small, and (b) I'm doing well over 1e6 iterations.

bench::mark(
  `sample-1` = replicate(1e5, sample.int(.Machine$integer.max, size=1)), 
  `sample-n` = sample.int(.Machine$integer.max, size=1e5),
  check = FALSE)
# Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# # A tibble: 2 × 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory               time             gc                
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>               <list>           <list>            
# 1 sample-1   296.94ms 315.05ms      3.17    3.29MB     20.6     2    13      630ms <NULL> <Rprofmem [597 × 3]> <bench_tm [2]>   <tibble [2 × 3]>  
# 2 sample-n     1.77ms   1.82ms    549.      1.38MB      0     275     0      501ms <NULL> <Rprofmem [2 × 3]>   <bench_tm [275]> <tibble [275 × 3]>

The biggest risk with this is rerunning the outer/inner loops without updating seeds for the purpose of assessing consistent variance ... but it'll be the same variability.

If your problem is that once in a million sims you want to be able to rerun one single run, and if calc-time is slow (so the cost of a single-pull is negligible), then sampling inside the inner loop is the quickest to code and mitigate the risk of not changing seeds.


Another option to just "save the state" is to save saved <- .Random.seed inside each inner loop, and return it with the data, either as an attribute or a list(.Random.seed, results). Clearly that won't work for saving to a .csv file, but if you are open-minded about that, then you can:

  • save a .rds or .rda file that includes the results and the seed, though this is specific to R;
  • save to a .parquet file with the seed as an attribute, preserved in most cases on reading it, though admittedly I don't know the ease of accessing frame attributes when reading in python (e.g.)
  • save the data to .csv and the saved-seed(s) to separate files, perhaps the latter to a per-sim or per-outerloop or per-innerloop .rds file
like image 173
r2evans Avatar answered Sep 02 '25 18:09

r2evans