In R, I am parallelising my simulation using packages foreach
, doFuture
, and doRNG
.
I have two nested foreach
loop: the inner loop generate and analyse data for each iteration, and the outer loop repeats that procedure for different scenarios of parameter values. A simple example of my code structure is as follows:
library(foreach)
library(doFuture)
library(doRNG)
library(data.table)
plan(multisession, workers = 2)
registerDoRNG()
set.seed(123)
scenarios <- data.table(x = c(0.1, 0.2, 0.3), scenario = 1:3)
out_dir <- getwd()
results <- foreach(s = 1:3, .combine = rbind) %do% {
this_x <- scenarios$x[s]
scen_id <- scenarios$scenario[s]
sim_res <- foreach(i = 1:100, .combine = rbind,
.options.future = list(seed = TRUE)) %dofuture% {
y <- rnorm(1, this_x)
data.frame(rep = i, y = y)
}
out_file <- file.path(out_dir, paste0("scenario_", scen_id, ".csv"))
fwrite(sim_res, out_file)
summary <- data.frame(scenario = scen_id, mean_y = mean(sim_res$y))
summary
}
print(results)
While the registerDoRNG()
function ensures safe reproducibility for parallel session, I have two issues that I would like your help:
1. The results of each scenario depends on the order that the scenario is run.
So e.g. the results of scenario 1 and 3 when I use foreach(s = 1:3...)
will be different from those when I use foreach(s = 3:1...)
2. I have not figured out how to store the state of the random-number generator at each iteration. This would be helpful if for example my R crash in the middle of the simulation and I want to start from where I left off, or if I want to investigate the results of a particular iteration.
I thought of setting a different seed for each iteration, like in here, but I read from here that this approach does not guarantee high-quality random numbers.
For the record, from the second link you posted, the "not good" use of
seeds <- sample.int(n = 10)
y <- parLapply(cl, seeds, function(seed) {
set.seed(seed)
runif(n = 5)
})
might be "not good" because in this case it only seeds with integers 1 to 10:
sample.int(n=10)
# [1] 7 5 10 2 9 8 4 3 1 6
I agree with the assessment that that is not a good approach to controlling random seeds.
In my slightly different approach, we pull from a much larger set of seeds,
sample.int(.Machine$integer.max, size=10)
# [1] 397286749 1052135137 1011271536 329404477 1906443373 1687197862 2021799661 1642042756 2121729063 1740460742
which will support a much broader surface of randomness. Since it relies on integers for seeding the random pool, it's certainly "ideally a little less random" than a non-seeded approach, but .Machine$integer.max
is very large and supports the vast majority of what one might need.
I suggest we can use this for a very robust method that allows you to store the seed in your results.
The most detailed resolution would be to set a seed for each inner loop:
outerloop <- 3; innerloop <- 100 # for clarity/reference here
seeds <- replicate(outerloop,
sample.int(.Machine$integer.max, size = innerloop),
simplify = FALSE)
results <- foreach(s = 1:outerloop, .combine = rbind) %do% {
# ...
sim_res <- foreach(i = 1:innerloop, .combine = rbind,
.options.future = list(seed = TRUE)) %dofuture% {
set.seed(seeds[[s]][i])
# ...
}
# ...
}
I think that's what you're targeting, but in case you only want/need to set it in the outer loop, then it can be a bit simpler:
seeds <- sample.int(.Machine$integer.max, size = outerloop)
results <- foreach(s = 1:innerloop, .combine = rbind) %do% {
set.seed(seeds[s])
# ...
}
You can choose how/where to embed the seeds[s]
within results
.
Another approach is to use sample.int(.)
inside each simulation and accompany its value in the output; for frames, this might be as an attribute (perhaps fragile) or as an invariant column (inefficient but rugged).
results <- foreach(s = 1:outerloop, .combine = rbind) %do% {
# ...
sim_res <- foreach(i = 1:innerloop, .combine = rbind,
.options.future = list(seed = TRUE)) %dofuture% {
seed <- sample.int(.Machine$integer.max, size=1)
res <- ...
### store seed as an attribute of the frame ...
attr(res, "seed") <- seed
### ... or as a column
res$seed <- seed
res
}
# ...
}
This works in effectively the same way, though I often prefer the pre-pulled seeds
for two reasons:
I find storing random seeds separately to be slightly safer: if I want to check that a change to my code fixed a bug, I cannot do it unless I have control over the seeds going in;
if you're doing it many many times, it can be measurably slower to do it one at a time, in the below case it's 183x slower. Generally I won't worry about this premature optimization unless (a) the inner calc is small, and (b) I'm doing well over 1e6 iterations.
bench::mark(
`sample-1` = replicate(1e5, sample.int(.Machine$integer.max, size=1)),
`sample-n` = sample.int(.Machine$integer.max, size=1e5),
check = FALSE)
# Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# # A tibble: 2 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 sample-1 296.94ms 315.05ms 3.17 3.29MB 20.6 2 13 630ms <NULL> <Rprofmem [597 × 3]> <bench_tm [2]> <tibble [2 × 3]>
# 2 sample-n 1.77ms 1.82ms 549. 1.38MB 0 275 0 501ms <NULL> <Rprofmem [2 × 3]> <bench_tm [275]> <tibble [275 × 3]>
The biggest risk with this is rerunning the outer/inner loops without updating seeds
for the purpose of assessing consistent variance ... but it'll be the same variability.
If your problem is that once in a million sims you want to be able to rerun one single run, and if calc-time is slow (so the cost of a single-pull is negligible), then sampling inside the inner loop is the quickest to code and mitigate the risk of not changing seeds
.
Another option to just "save the state" is to save saved <- .Random.seed
inside each inner loop, and return it with the data, either as an attribute or a list(.Random.seed, results)
. Clearly that won't work for saving to a .csv
file, but if you are open-minded about that, then you can:
.rds
or .rda
file that includes the results and the seed, though this is specific to R;.parquet
file with the seed as an attribute, preserved in most cases on reading it, though admittedly I don't know the ease of accessing frame attributes when reading in python (e.g.).csv
and the saved-seed(s) to separate files, perhaps the latter to a per-sim or per-outerloop or per-innerloop .rds
fileIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With