Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run permutations using mclapply in a reproducible way regardless of number of threads and OS?

Tags:

r

Is is possible to run some permutation-based function using mclapply in a reproducible way regardless of number of threads and OS?
Below is a toy example. Hashing of the resulting list of permutated vectors is just for convenience of comparing the results. I tried different RNGkind ("L'Ecuyer-CMRG"), different settings for mc.preschedule and mc.set.seed. So far no luck to make them all identical.

library("parallel")
library("digest")

set.seed(1)
m <- mclapply(1:10, function(x) sample(1:10),
              mc.cores=2, mc.set.seed = F)
digest(m, 'crc32')

set.seed(1)
m <- mclapply(1:10, function(x) sample(1:10),
              mc.cores=4, mc.set.seed = F)
digest(m, 'crc32')

set.seed(1)
m <- mclapply(1:10, function(x) sample(1:10),
              mc.cores=2, mc.set.seed = F)
digest(m, 'crc32')

set.seed(1)
m <- mclapply(1:10, function(x) sample(1:10),
              mc.cores=1, mc.set.seed = F)
digest(m, 'crc32')

set.seed(1)
m <- lapply(1:10, function(x) sample(1:10))
digest(m, 'crc32') # this is equivalent to what I get on Windows.

sessionInfo() just in case:

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] digest_0.6.8

loaded via a namespace (and not attached):
[1] tools_3.2.0
like image 635
Vlad Avatar asked Nov 09 '22 12:11

Vlad


1 Answers

Another approach is to first generate the samples that you would like to use and call mclapply on the samples:

    library("parallel")
    library("digest")

    input<-1:10
    set.seed(1)
    nsamp<-20
    ## Generate and store all the random samples
    samples<-lapply(1:nsamp, function(x){ sample(input) })

    ## apply the algorithm "diff" on every sample
    ncore0<-  lapply(samples, diff)
    ncore1<-mclapply(samples, diff, mc.cores=1)
    ncore2<-mclapply(samples, diff, mc.cores=2)
    ncore3<-mclapply(samples, diff, mc.cores=3)
    ncore4<-mclapply(samples, diff, mc.cores=4)

    ## all equal
    all.equal(ncore0,ncore1)
    all.equal(ncore0,ncore2)
    all.equal(ncore0,ncore3)
    all.equal(ncore0,ncore4)

This assures the reproducibility at the expense of using more memory and slightly longer running time since the computation done on each sample is typically the most time-consuming operation.

Note: The use of mc.set.seed = F in your question will generate the same sample for each core, which is probably not what you want.

like image 67
fishtank Avatar answered Nov 15 '22 06:11

fishtank