Is is possible to run some permutation-based function using mclapply in a reproducible way regardless of number of threads and OS?
Below is a toy example. Hashing of the resulting list of permutated vectors is just for convenience of comparing the results. I tried different RNGkind
("L'Ecuyer-CMRG"), different settings for mc.preschedule
and mc.set.seed
. So far no luck to make them all identical.
library("parallel")
library("digest")
set.seed(1)
m <- mclapply(1:10, function(x) sample(1:10),
mc.cores=2, mc.set.seed = F)
digest(m, 'crc32')
set.seed(1)
m <- mclapply(1:10, function(x) sample(1:10),
mc.cores=4, mc.set.seed = F)
digest(m, 'crc32')
set.seed(1)
m <- mclapply(1:10, function(x) sample(1:10),
mc.cores=2, mc.set.seed = F)
digest(m, 'crc32')
set.seed(1)
m <- mclapply(1:10, function(x) sample(1:10),
mc.cores=1, mc.set.seed = F)
digest(m, 'crc32')
set.seed(1)
m <- lapply(1:10, function(x) sample(1:10))
digest(m, 'crc32') # this is equivalent to what I get on Windows.
sessionInfo()
just in case:
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] digest_0.6.8
loaded via a namespace (and not attached):
[1] tools_3.2.0
Another approach is to first generate the samples that you would like to use and call mclapply on the samples:
library("parallel")
library("digest")
input<-1:10
set.seed(1)
nsamp<-20
## Generate and store all the random samples
samples<-lapply(1:nsamp, function(x){ sample(input) })
## apply the algorithm "diff" on every sample
ncore0<- lapply(samples, diff)
ncore1<-mclapply(samples, diff, mc.cores=1)
ncore2<-mclapply(samples, diff, mc.cores=2)
ncore3<-mclapply(samples, diff, mc.cores=3)
ncore4<-mclapply(samples, diff, mc.cores=4)
## all equal
all.equal(ncore0,ncore1)
all.equal(ncore0,ncore2)
all.equal(ncore0,ncore3)
all.equal(ncore0,ncore4)
This assures the reproducibility at the expense of using more memory and slightly longer running time since the computation done on each sample is typically the most time-consuming operation.
Note: The use of mc.set.seed = F
in your question will generate the same sample for each core, which is probably not what you want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With