We are facing a weird situation in our code when using R's runif
and setting seed with set.seed
with the kind = NULL
option (which resolves, unless I am mistaken, to kind = "default"
; the default being "Mersenne-Twister"
).
We set the seed using (8 digit) unique IDs generated by an upstream system, before calling runif
:
seeds = c(
"86548915", "86551615", "86566163", "86577411", "86584144",
"86584272", "86620568", "86724613", "86756002", "86768593", "86772411",
"86781516", "86794389", "86805854", "86814600", "86835092", "86874179",
"86876466", "86901193", "86987847", "86988080")
random_values = sapply(seeds, function(x) {
set.seed(x)
y = runif(1, 17, 26)
return(y)
})
This gives values that are extremely bunched together.
> summary(random_values)
Min. 1st Qu. Median Mean 3rd Qu. Max.
25.13 25.36 25.66 25.58 25.83 25.94
This behaviour of runif
goes away when we use kind = "Knuth-TAOCP-2002"
, and we get values that appear to be much more evenly spread out.
random_values = sapply(seeds, function(x) {
set.seed(x, kind = "Knuth-TAOCP-2002")
y = runif(1, 17, 26)
return(y)
})
Output omitted.
The most interesting thing here is that this does not happen on Windows -- only happens on Ubuntu (sessionInfo
output for Ubuntu & Windows below).
> seeds = c(
+ "86548915", "86551615", "86566163", "86577411", "86584144",
+ "86584272", "86620568", "86724613", "86756002", "86768593", "86772411",
+ "86781516", "86794389", "86805854", "86814600", "86835092", "86874179",
+ "86876466", "86901193", "86987847", "86988080")
>
> random_values = sapply(seeds, function(x) {
+ set.seed(x)
+ y = runif(1, 17, 26)
+ return(y)
+ })
>
> summary(random_values)
Min. 1st Qu. Median Mean 3rd Qu. Max.
17.32 20.14 23.00 22.17 24.07 25.90
Can someone help understand what is going on?
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
[9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] RMySQL_0.10.8 DBI_0.6-1
[3] jsonlite_1.4 tidyjson_0.2.2
[5] optiRum_0.37.3 lubridate_1.6.0
[7] httr_1.2.1 gdata_2.18.0
[9] XLConnect_0.2-12 XLConnectJars_0.2-12
[11] data.table_1.10.4 stringr_1.2.0
[13] readxl_1.0.0 xlsx_0.5.7
[15] xlsxjars_0.6.1 rJava_0.9-8
[17] sqldf_0.4-10 RSQLite_1.1-2
[19] gsubfn_0.6-6 proto_1.0.0
[21] dplyr_0.5.0 purrr_0.2.4
[23] readr_1.1.1 tidyr_0.6.3
[25] tibble_1.3.0 tidyverse_1.1.1
[27] rBayesianOptimization_1.1.0 xgboost_0.6-4
[29] MLmetrics_1.1.1 caret_6.0-76
[31] ROCR_1.0-7 gplots_3.0.1
[33] effects_3.1-2 pROC_1.10.0
[35] pscl_1.4.9 lattice_0.20-35
[37] MASS_7.3-47 ggplot2_2.2.1
loaded via a namespace (and not attached):
[1] splines_3.4.0 foreach_1.4.3 AUC_0.3.0 modelr_0.1.0
[5] gtools_3.5.0 assertthat_0.2.0 stats4_3.4.0 cellranger_1.1.0
[9] quantreg_5.33 chron_2.3-50 digest_0.6.10 rvest_0.3.2
[13] minqa_1.2.4 colorspace_1.3-2 Matrix_1.2-10 plyr_1.8.4
[17] psych_1.7.3.21 XML_3.98-1.7 broom_0.4.2 SparseM_1.77
[21] haven_1.0.0 scales_0.4.1 lme4_1.1-13 MatrixModels_0.4-1
[25] mgcv_1.8-17 car_2.1-5 nnet_7.3-12 lazyeval_0.2.0
[29] pbkrtest_0.4-7 mnormt_1.5-5 magrittr_1.5 memoise_1.0.0
[33] nlme_3.1-131 forcats_0.2.0 xml2_1.1.1 foreign_0.8-69
[37] tools_3.4.0 hms_0.3 munsell_0.4.3 compiler_3.4.0
[41] caTools_1.17.1 rlang_0.1.1 grid_3.4.0 nloptr_1.0.4
[45] iterators_1.0.8 bitops_1.0-6 tcltk_3.4.0 gtable_0.2.0
[49] ModelMetrics_1.1.0 codetools_0.2-15 reshape2_1.4.2 R6_2.2.0
[53] knitr_1.15.1 KernSmooth_2.23-15 stringi_1.1.5 Rcpp_0.12.11
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252 LC_MONETARY=English_India.1252
[4] LC_NUMERIC=C LC_TIME=English_India.1252
attached base packages:
[1] graphics grDevices utils datasets grid stats methods base
other attached packages:
[1] bindrcpp_0.2 h2o_3.14.0.3 ggrepel_0.6.5 eulerr_1.1.0 VennDiagram_1.6.17
[6] futile.logger_1.4.3 scales_0.4.1 FinCal_0.6.3 xml2_1.0.0 httr_1.3.0
[11] wesanderson_0.3.2 wordcloud_2.5 RColorBrewer_1.1-2 htmltools_0.3.6 urltools_1.6.0
[16] timevis_0.4 dtplyr_0.0.1 magrittr_1.5 shiny_1.0.5 RODBC_1.3-14
[21] zoo_1.8-0 sqldf_0.4-10 RSQLite_1.1-2 gsubfn_0.6-6 proto_1.0.0
[26] gdata_2.17.0 stringr_1.2.0 XLConnect_0.2-12 XLConnectJars_0.2-12 data.table_1.10.4
[31] xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-8 readxl_0.1.1 googlesheets_0.2.1
[36] jsonlite_1.5 tidyjson_0.2.1 RMySQL_0.10.9 RPostgreSQL_0.4-1 DBI_0.5-1
[41] dplyr_0.7.2 purrr_0.2.3 readr_1.1.1 tidyr_0.7.0 tibble_1.3.3
[46] ggplot2_2.2.0 tidyverse_1.0.0 lubridate_1.6.0
loaded via a namespace (and not attached):
[1] gtools_3.5.0 assertthat_0.2.0 triebeard_0.3.0 cellranger_1.1.0 yaml_2.1.14
[6] slam_0.1-40 lattice_0.20-34 glue_1.1.1 chron_2.3-48 digest_0.6.12.1
[11] colorspace_1.3-1 httpuv_1.3.5 plyr_1.8.4 pkgconfig_2.0.1 xtable_1.8-2
[16] lazyeval_0.2.0 mime_0.5 memoise_1.0.0 tools_3.3.2 hms_0.3
[21] munsell_0.4.3 lambda.r_1.1.9 rlang_0.1.1 RCurl_1.95-4.8 labeling_0.3
[26] bitops_1.0-6 tcltk_3.3.2 gtable_0.2.0 reshape2_1.4.2 R6_2.2.0
[31] bindr_0.1 futile.options_1.0.0 stringi_1.1.2 Rcpp_0.12.12.1
NOTE: This answer summarizes elements of a discussion of this question that took place on the R-devel mailing list. I've merely tried to capture and summarize the ideas originally articulated there.
Despite your assurances that these numbers are not a specially constructed edge case, they have every appearance of being just that. Here is the original sequence plus code to check the distribution of values produced:
seeds = c(
86548915, 86551615, 86566163, 86577411, 86584144, 86584272,
86620568, 86724613, 86756002, 86768593, 86772411, 86781516,
86794389, 86805854, 86814600, 86835092, 86874179, 86876466,
86901193, 86987847, 86988080)
checkit <- function(seeds) {
sapply(seeds, function(x) {
set.seed(x)
y = runif(1, 17, 26)
return(y)
})}
The original sequence shows surprisingly little variation, as noted:
summary(checkit(seeds+0))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
##25.13 25.36 25.66 25.58 25.83 25.94
There does appear to be something special about the original sequence, since minimal modifications to it do not produce the same surprising result:
summary(checkit(seeds+1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.18 19.65 22.75 22.02 24.37 25.79
summary(checkit(seeds-1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
##17.15 18.44 19.92 20.77 22.97 25.95
Out of the all the seeds in the range spanned by the original sequence, the expected number produce values in the observed range:
possible.seeds <- min(seeds):max(seeds)
s25 <- Filter(function(s){
set.seed(s)
x <- runif(1,17,26)
x > 25.12 & x < 25.95},
possible.seeds)
length(s25)/length(possible.seeds)
##[1] 0.09175801
And yet, all the values in the original sequence are in this subset (of course we knew this already...).
table(seeds %in% s25)
##TRUE
## 21
All of which points to the likely possibility that the original sequence was in fact a (perhaps unintentionally) specially constructed edge case.
As further evidence that your sequence is an edge-case, you can focus on the range of the putatively random values was constructed. The 17 and 26 are a bit of a distraction. Repeating your experiment with uniform on 0 and 1 generates something equally as unlikely:
f <- function(x) {
set.seed(x)
runif(1)
}
check_range <-function(seeds){
vals <- sapply(seeds,f)
max(vals)-min(vals)
}
When run against your seeds:
> check_range(seeds)
[1] 0.09026112
A reasonable model for check_range(seeds)
when run on 21 random seeds is that it is the sample range of a random sample of size 21 drawn U(0,1)
. Its theoretical density is given by:
f <- function(x){420*x^19*(1-x)}
We can use this to calculate the probability of observing a range of 0.09 or smaller:
> integrate(f,0,0.09)
2.334272e-20 with absolute error < 2.6e-34
As a check that it is reasonable to model the sample range when seeding the Mersenne Twister like that, you can do the following experiment:
ranges <- replicate(1000,check_range(sample(8548915:86988080,21)))
x <- seq(0,1,0.01)
y <- f(x)
hist(ranges,freq = FALSE,xlim =c(0,1))
points(x,y,type = "l")
abline(v=0.09)
Output:
The density histogram follows the theoretical density reasonably well. The set of 21 seeds from your question represents an extreme outlier. It is very unlikely to be due to chance and is also unlikely to be due to some underlying flaw in the Mersenne Twister. The most likely explanation is that the Mersenne Twister itself was involved in producing those 21 values (but not of course in the naïve way of simply using sample()
to draw 21 values).
When you're using Mersenne Twister with a single seed, it's reasonable to assume that the generated values are approximately independent and identically distributed. Unfortunately, there are no guarantees about the generated values from two streams started at different seeds. See, for example, this SC thread.
In your situation, I would recommend either using one of the seed selection strategies suggested in the SC thread, or else switching to a PRNG with better guarantees about parallel streams. One option is L'Ecuyer's "RngStreams" generator:
set.seed(0, kind = "L'Ecuyer-CMRG")
Even with that PRNG, I don't know if it still holds that you can seed the PRNG with arbitrary seeds and get roughly independent streams.
As far as the difference between Ubuntu and Windows, it's possible that one of these systems is using a 32-bit generator, and the other is using 64-bit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With