Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Extreme bunching of random values from runif with Mersenne-Twister seed

We are facing a weird situation in our code when using R's runif and setting seed with set.seed with the kind = NULL option (which resolves, unless I am mistaken, to kind = "default"; the default being "Mersenne-Twister").

We set the seed using (8 digit) unique IDs generated by an upstream system, before calling runif:

seeds = c(
  "86548915", "86551615", "86566163", "86577411", "86584144", 
  "86584272", "86620568", "86724613", "86756002", "86768593", "86772411", 
  "86781516", "86794389", "86805854", "86814600", "86835092", "86874179", 
  "86876466", "86901193", "86987847", "86988080")

random_values = sapply(seeds, function(x) {
  set.seed(x)
  y = runif(1, 17, 26)
  return(y)
})

This gives values that are extremely bunched together.

> summary(random_values)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  25.13   25.36   25.66   25.58   25.83   25.94 

This behaviour of runif goes away when we use kind = "Knuth-TAOCP-2002", and we get values that appear to be much more evenly spread out.

random_values = sapply(seeds, function(x) {
  set.seed(x, kind = "Knuth-TAOCP-2002")
  y = runif(1, 17, 26)
  return(y)
})

Output omitted.


The most interesting thing here is that this does not happen on Windows -- only happens on Ubuntu (sessionInfo output for Ubuntu & Windows below).

Windows output:

> seeds = c(
+   "86548915", "86551615", "86566163", "86577411", "86584144", 
+   "86584272", "86620568", "86724613", "86756002", "86768593", "86772411", 
+   "86781516", "86794389", "86805854", "86814600", "86835092", "86874179", 
+   "86876466", "86901193", "86987847", "86988080")
> 
> random_values = sapply(seeds, function(x) {
+   set.seed(x)
+   y = runif(1, 17, 26)
+   return(y)
+ })
> 
> summary(random_values)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  17.32   20.14   23.00   22.17   24.07   25.90 

Can someone help understand what is going on?

Ubuntu

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
 [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     
[11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RMySQL_0.10.8               DBI_0.6-1                  
 [3] jsonlite_1.4                tidyjson_0.2.2             
 [5] optiRum_0.37.3              lubridate_1.6.0            
 [7] httr_1.2.1                  gdata_2.18.0               
 [9] XLConnect_0.2-12            XLConnectJars_0.2-12       
[11] data.table_1.10.4           stringr_1.2.0              
[13] readxl_1.0.0                xlsx_0.5.7                 
[15] xlsxjars_0.6.1              rJava_0.9-8                
[17] sqldf_0.4-10                RSQLite_1.1-2              
[19] gsubfn_0.6-6                proto_1.0.0                
[21] dplyr_0.5.0                 purrr_0.2.4                
[23] readr_1.1.1                 tidyr_0.6.3                
[25] tibble_1.3.0                tidyverse_1.1.1            
[27] rBayesianOptimization_1.1.0 xgboost_0.6-4              
[29] MLmetrics_1.1.1             caret_6.0-76               
[31] ROCR_1.0-7                  gplots_3.0.1               
[33] effects_3.1-2               pROC_1.10.0                
[35] pscl_1.4.9                  lattice_0.20-35            
[37] MASS_7.3-47                 ggplot2_2.2.1              

loaded via a namespace (and not attached):
[1] splines_3.4.0      foreach_1.4.3      AUC_0.3.0          modelr_0.1.0      
 [5] gtools_3.5.0       assertthat_0.2.0   stats4_3.4.0       cellranger_1.1.0  
 [9] quantreg_5.33      chron_2.3-50       digest_0.6.10      rvest_0.3.2       
[13] minqa_1.2.4        colorspace_1.3-2   Matrix_1.2-10      plyr_1.8.4        
[17] psych_1.7.3.21     XML_3.98-1.7       broom_0.4.2        SparseM_1.77      
[21] haven_1.0.0        scales_0.4.1       lme4_1.1-13        MatrixModels_0.4-1
[25] mgcv_1.8-17        car_2.1-5          nnet_7.3-12        lazyeval_0.2.0    
[29] pbkrtest_0.4-7     mnormt_1.5-5       magrittr_1.5       memoise_1.0.0     
[33] nlme_3.1-131       forcats_0.2.0      xml2_1.1.1         foreign_0.8-69    
[37] tools_3.4.0        hms_0.3            munsell_0.4.3      compiler_3.4.0    
[41] caTools_1.17.1     rlang_0.1.1        grid_3.4.0         nloptr_1.0.4      
[45] iterators_1.0.8    bitops_1.0-6       tcltk_3.4.0        gtable_0.2.0      
[49] ModelMetrics_1.1.0 codetools_0.2-15   reshape2_1.4.2     R6_2.2.0          
[53] knitr_1.15.1       KernSmooth_2.23-15 stringi_1.1.5      Rcpp_0.12.11  

Windows

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252    LC_MONETARY=English_India.1252
[4] LC_NUMERIC=C                   LC_TIME=English_India.1252    

attached base packages:
[1] graphics  grDevices utils     datasets  grid      stats     methods   base     

other attached packages:
 [1] bindrcpp_0.2         h2o_3.14.0.3         ggrepel_0.6.5        eulerr_1.1.0         VennDiagram_1.6.17  
 [6] futile.logger_1.4.3  scales_0.4.1         FinCal_0.6.3         xml2_1.0.0           httr_1.3.0          
[11] wesanderson_0.3.2    wordcloud_2.5        RColorBrewer_1.1-2   htmltools_0.3.6      urltools_1.6.0      
[16] timevis_0.4          dtplyr_0.0.1         magrittr_1.5         shiny_1.0.5          RODBC_1.3-14        
[21] zoo_1.8-0            sqldf_0.4-10         RSQLite_1.1-2        gsubfn_0.6-6         proto_1.0.0         
[26] gdata_2.17.0         stringr_1.2.0        XLConnect_0.2-12     XLConnectJars_0.2-12 data.table_1.10.4   
[31] xlsx_0.5.7           xlsxjars_0.6.1       rJava_0.9-8          readxl_0.1.1         googlesheets_0.2.1  
[36] jsonlite_1.5         tidyjson_0.2.1       RMySQL_0.10.9        RPostgreSQL_0.4-1    DBI_0.5-1           
[41] dplyr_0.7.2          purrr_0.2.3          readr_1.1.1          tidyr_0.7.0          tibble_1.3.3        
[46] ggplot2_2.2.0        tidyverse_1.0.0      lubridate_1.6.0     

loaded via a namespace (and not attached):
 [1] gtools_3.5.0         assertthat_0.2.0     triebeard_0.3.0      cellranger_1.1.0     yaml_2.1.14         
 [6] slam_0.1-40          lattice_0.20-34      glue_1.1.1           chron_2.3-48         digest_0.6.12.1     
[11] colorspace_1.3-1     httpuv_1.3.5         plyr_1.8.4           pkgconfig_2.0.1      xtable_1.8-2        
[16] lazyeval_0.2.0       mime_0.5             memoise_1.0.0        tools_3.3.2          hms_0.3             
[21] munsell_0.4.3        lambda.r_1.1.9       rlang_0.1.1          RCurl_1.95-4.8       labeling_0.3        
[26] bitops_1.0-6         tcltk_3.3.2          gtable_0.2.0         reshape2_1.4.2       R6_2.2.0            
[31] bindr_0.1            futile.options_1.0.0 stringi_1.1.2        Rcpp_0.12.12.1      
like image 985
tchakravarty Avatar asked Nov 02 '17 16:11

tchakravarty


3 Answers

NOTE: This answer summarizes elements of a discussion of this question that took place on the R-devel mailing list. I've merely tried to capture and summarize the ideas originally articulated there.

Despite your assurances that these numbers are not a specially constructed edge case, they have every appearance of being just that. Here is the original sequence plus code to check the distribution of values produced:

seeds = c(
    86548915, 86551615, 86566163, 86577411, 86584144, 86584272,
    86620568, 86724613, 86756002, 86768593, 86772411, 86781516,
    86794389, 86805854, 86814600, 86835092, 86874179, 86876466,
    86901193, 86987847, 86988080)
checkit <- function(seeds) {
    sapply(seeds, function(x) {
        set.seed(x)
        y = runif(1, 17, 26)
        return(y)
    })}

The original sequence shows surprisingly little variation, as noted:

  summary(checkit(seeds+0))
  ## Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  ##25.13   25.36   25.66   25.58   25.83   25.94 

There does appear to be something special about the original sequence, since minimal modifications to it do not produce the same surprising result:

summary(checkit(seeds+1))
## Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 17.18   19.65   22.75   22.02   24.37   25.79

summary(checkit(seeds-1))
## Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##17.15   18.44   19.92   20.77   22.97   25.95 

Out of the all the seeds in the range spanned by the original sequence, the expected number produce values in the observed range:

possible.seeds <- min(seeds):max(seeds)

s25 <- Filter(function(s){
    set.seed(s)
    x <- runif(1,17,26)
    x > 25.12 & x < 25.95},
    possible.seeds)

length(s25)/length(possible.seeds)
##[1] 0.09175801

And yet, all the values in the original sequence are in this subset (of course we knew this already...).

table(seeds %in% s25)

##TRUE 
##  21 

All of which points to the likely possibility that the original sequence was in fact a (perhaps unintentionally) specially constructed edge case.

like image 160
Ista Avatar answered Nov 07 '22 01:11

Ista


As further evidence that your sequence is an edge-case, you can focus on the range of the putatively random values was constructed. The 17 and 26 are a bit of a distraction. Repeating your experiment with uniform on 0 and 1 generates something equally as unlikely:

f <- function(x) {
  set.seed(x)
  runif(1)
}

 check_range <-function(seeds){
   vals <- sapply(seeds,f)
   max(vals)-min(vals)
}

When run against your seeds:

> check_range(seeds)
[1] 0.09026112

A reasonable model for check_range(seeds) when run on 21 random seeds is that it is the sample range of a random sample of size 21 drawn U(0,1). Its theoretical density is given by:

f <- function(x){420*x^19*(1-x)}

We can use this to calculate the probability of observing a range of 0.09 or smaller:

> integrate(f,0,0.09)
2.334272e-20 with absolute error < 2.6e-34

As a check that it is reasonable to model the sample range when seeding the Mersenne Twister like that, you can do the following experiment:

ranges <- replicate(1000,check_range(sample(8548915:86988080,21)))
x <- seq(0,1,0.01)
y <- f(x)
hist(ranges,freq = FALSE,xlim =c(0,1))
points(x,y,type = "l")
abline(v=0.09)

Output:

enter image description here

The density histogram follows the theoretical density reasonably well. The set of 21 seeds from your question represents an extreme outlier. It is very unlikely to be due to chance and is also unlikely to be due to some underlying flaw in the Mersenne Twister. The most likely explanation is that the Mersenne Twister itself was involved in producing those 21 values (but not of course in the naïve way of simply using sample() to draw 21 values).

like image 36
John Coleman Avatar answered Nov 07 '22 00:11

John Coleman


When you're using Mersenne Twister with a single seed, it's reasonable to assume that the generated values are approximately independent and identically distributed. Unfortunately, there are no guarantees about the generated values from two streams started at different seeds. See, for example, this SC thread.

In your situation, I would recommend either using one of the seed selection strategies suggested in the SC thread, or else switching to a PRNG with better guarantees about parallel streams. One option is L'Ecuyer's "RngStreams" generator:

set.seed(0, kind = "L'Ecuyer-CMRG")

Even with that PRNG, I don't know if it still holds that you can seed the PRNG with arbitrary seeds and get roughly independent streams.

As far as the difference between Ubuntu and Windows, it's possible that one of these systems is using a 32-bit generator, and the other is using 64-bit.

like image 1
Patrick Perry Avatar answered Nov 06 '22 23:11

Patrick Perry