Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are two random deviates needed to ensure uniform sampling of large integers with sample()?

Given that the following are equivalent we might infer that R uses the same C runif function to generate uniform samples for sample() and runif()...

set.seed(1)
sample(1000,10,replace=TRUE)
#[1] 27 38 58 91 21 90 95 67 63  7

set.seed(1)
ceiling( runif(10) * 1000 )
#[1] 27 38 58 91 21 90 95 67 63  7

However they are not equivalent when dealing with big numbers ( n > 2^32 - 1 ):

set.seed(1)
ceiling( runif(1e1) * as.numeric(10^12) )
#[1] 265508663143 372123899637 572853363352 908207789995 201681931038 898389684968
#[7] 944675268606 660797792487 629114043899  61786270468

set.seed(1)
sample( as.numeric(10^12) , 1e1 , replace = TRUE )
#[1] 2655086629 5728533837 2016819388 9446752865 6291140337 2059745544 6870228465
#[8] 7698414177 7176185248 3800351852

Update

As @Arun points out the 1st, 3rd, 5th, ... of runif() approximate result of 1st, 2nd, 3rd... from sample().

It turns out that both functions call unif_rand() behind the scenes, however, sample, given an argument, n that is larger than the largest representable integer of type "integer" but representable as an integer as type "numeric" uses this static definition to draw a random deviate (as opposed to just unif_rand() as in the case of runif())...

static R_INLINE double ru()
   {
       double U = 33554432.0;
       return (floor(U*unif_rand()) + unif_rand())/U;
   }

With a cryptic note in the docs that...

Two random numbers are used to ensure uniform sampling of large integers.

  • Why are two random numbers needed to ensure uniform sampling of large integers?

  • What is the constant U for and why does it take the particular value 33554432.0?

like image 679
Simon O'Hanlon Avatar asked Nov 07 '13 14:11

Simon O'Hanlon


1 Answers

The reason is that a 25-bit PRNG won't produce enough bits to generate all possible integer values in a range wider than 2^25. In order to give a non-zero probability to every possible integer value, it is necessary to call the 25-bit PRNG twice. With two calls (like in the code you cite), you get 50 random bits.

Note that a double has 53 bits of mantissa, so calling the PRNG twice is still short of 3 bits.

like image 55
Rémi Avatar answered Jan 26 '23 03:01

Rémi