Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R sample probabilities: Default is equal weight; why does specifying equal weights cause different values to be returned?

I have a simple question regarding the sample function in R. I'm randomly sampling from 0s and 1s and summing them together, from an input vector of length 5, which designates the number of trials to run and sets the seed to generate reproducible random numbers. Seed works as expected, but I get different matrices of random numbers depending on what I put in the prob statement. In this case I assumed prob=NULL should be the same as prob=c(0.5,0.5). Why isn't it?

vn<-c(12, 44, 9, 17, 28)

> do.call(cbind, lapply(c(1:10),function(X) {set.seed(X); sapply(vn, function(Y) sum(sample(x=c(0,1),size=Y,replace=T)), simplify=TRUE)}))

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    6    7    7    6    6    9    3    6    2     5
[2,]   22   21   20   29   22   24   24   19   25    19
[3,]    4    8    3    5    4    4    4    6    4     2
[4,]    8    4   12    9   11    7    9   10    8     8
[5,]   13    9   11   14   12   14   10   13   11    12

> do.call(cbind, lapply(c(1:10),function(X) {set.seed(X); sapply(vn, function(Y) sum(sample(x=c(0,1),size=Y,replace=T, prob=c(0.5,0.5))), simplify=TRUE)}))

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    6    5    5    6    6    3    9    6   10     7
[2,]   22   23   24   15   22   20   20   25   19    25
[3,]    5    1    6    4    5    5    5    3    5     7
[4,]    9   13    5    8    6   10    8    7    9     9
[5,]   15   19   17   14   16   14   18   15   17    16

UPDATE:

I extended the samplings to 100, with an input vector

vn<-seq(0,100,5)

and compared the rowMeans of the output matrices without prob (test1) and with prob=c(0.5,0.5) against expected mean. Interestingly, test1 and test2 are off by the exact same amount by reversed signs. Why is that? Thanks!

> rowMeans(test1)-seq(0,100,5)/2
 [1]  0.00 -0.07 -0.01 -0.35 -0.07  0.19 -0.07  0.24  0.21  0.46  0.20  0.50 -0.37 -0.35  0.00  0.64 -0.59  0.63 -1.19  0.44 -0.38

> rowMeans(test2)-seq(0,100,5)/2
 [1]  0.00  0.07  0.01  0.35  0.07 -0.19  0.07 -0.24 -0.21 -0.46 -0.20 -0.50  0.37  0.35  0.00 -0.64  0.59 -0.63  1.19 -0.44  0.38
like image 771
4 revs, 3 users 95% Avatar asked Feb 13 '23 03:02

4 revs, 3 users 95%


2 Answers

As suggested by Randy, different routines are used by sample.int depending on whether prop is NULL.

In your case, it returns inverse results:

> set.seed(1); sample(c(0,1), size=20, replace=TRUE)
 [1] 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1
> set.seed(1); sample(c(0,1), size=20, replace=TRUE, prob=c(.5,.5))
 [1] 1 1 0 0 1 0 0 0 0 1 1 1 0 1 0 1 0 0 1 0

What's going on?

For the former, we hit line src/main/random.c:546:

 for (int i = 0; i < k; i++) iy[i] = (int)(dn * unif_rand() + 1);

This one is simple. unif_rand() returns a value between 0 and 1 (and will never return 1), dn is 2 (the number of elements in x) so iy[i] is set to 1 or 2, depending on whether unif_rand() returns a value < .5 or >= .5 respectively; this is the value chosen from x.

The latter is a bit more complex. Because prob is specified, do_sample calls the function ProbSampleReplace at src/main/random.c:309. Here, the probabilities are sorted into descending order with the function revsort at src/main/sort.c:248. This uses a heap sort on the probabilities, and with a two-element vector of equal probabilities, it reverses the order.

ProbSampleReplace again calls unif_rand() but this time it maps it to the cumulative probabilities computed after flipping the order of the vector, so if unif_rand() returns a value < 0.5 the second value is returned (1 in your example). This is the code that does the mapping of unif_rand() to the values in x:

/* compute the sample */
for (i = 0; i < nans; i++) {
    rU = unif_rand();
    for (j = 0; j < nm1; j++) {
        if (rU <= p[j])
            break;
    }
    ans[i] = perm[j];
}

So with equal probabilities of two elements, setting the probability explicitly to c(0.5, 0.5) will return the inverse of the same call without setting the probabilities. With more than two elements, it's not going to always reverse them, but it won't return the same order.

This also explains why Fernando's suggestion works. The values are close enough to .5 as to not change the results for this example, and the heap sort returns the values in the original order.

This expression returns the same matrix as your first line of code:

do.call(cbind, lapply(c(1:10),function(X) {set.seed(X); sapply(vn, function(Y) sum(sample(x=c(1,0),size=Y,replace=T, prob=c(0.5,0.5))), simplify=TRUE)}))

Here, the order of the entries in x have been reversed to account for the two-element sort of equal values (which swaps the entries).

Of course this is all academic. In practice, permuting the order of equiprobable entries doesn't matter.

Source files and line numbers above refer to R 3.0.2.

like image 147
Matthew Lundberg Avatar answered Feb 16 '23 03:02

Matthew Lundberg


I updated my comment to an answer. sample uses different c routines for uniform sampling and weighted sampling. Though you are using equal weights, R will call the weighted sampling anyway. To see this, consider

> set.seed(1)
> sample.int(100)
  [1]  27  37  57  89  20  86  97  62  58   6  19  16  61  34  67  43  88  83
 [19]  32  63  75  17  51  10  21  29   1  28  81  25  87  42  70  13  55  44
 [37]  78   7  45  26  50  39  46  82  30  65   2  84  59  36  24  85  22  12
 [55]   4   5  14  23  73  79  99  47  18  95  60  77  41  53   3  69  11  71
 [73]  35  31  40  49  76   9  38  64  80  66   8  91  33  92 100  54  98  94
 [91]  52  74  68  72  93  15  56  48  90  96
> set.seed(1)
> sample.int(100, prob = rep(1/100, 100))
  [1]  28  39  60  93  21  91  96  67  63   7  22  18  71  41  79  51  74   1
 [19]  38  78  94  20  64  12  29  40   2  42  87  35  50  61  52  17  84  69
 [37]  81  10  73  44  85  65  80  54  49  82   4  46  75  68  43  90  36  23
 [55]   8  11  30  55  66  34  97  26  47  31  70  24  53  86   6  95  32  89
 [73]  27  33  56  98  88  25  77 100  37  62  19  15  76  13  59   5  14   9
 [91]  45   3  83  99  72  58  48  57  92  16

Note that the two different sampled sequences.

like image 22
Randy Lai Avatar answered Feb 16 '23 02:02

Randy Lai