Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What happens when prob argument in sample sums to less/greater than 1?

Tags:

r

sample

We know that prob argument in sample is used to assign a probability of weights.

For example,

table(sample(1:4, 1e6, replace = TRUE, prob = c(0.2, 0.4, 0.3, 0.1)))/1e6

#  1   2   3   4 
#0.2 0.4 0.3 0.1 


table(sample(1:4, 1e6, replace = TRUE, prob = c(0.2, 0.4, 0.3, 0.1)))/1e6

#    1     2     3     4 
#0.200 0.400 0.299 0.100 

In this example, the sum of probability is exactly 1 (0.2 + 0.4 + 0.3 + 0.1), hence it gives the expected ratio but what if the probability does not sum to 1? What output would it give? I thought it would result in an error but it gives some value.

When the probability sums up to more than 1.

table(sample(1:4, 1e6, replace = TRUE, prob = c(0.2, 0.5, 0.5, 0.1)))/1e6

#     1      2      3      4 
#0.1544 0.3839 0.3848 0.0768 

table(sample(1:4, 1e6, replace = TRUE, prob = c(0.2, 0.5, 0.5, 0.1)))/1e6

#     1      2      3      4 
#0.1544 0.3842 0.3848 0.0767 

When the probability sums up to less than 1

table(sample(1:4, 1e6, replace = TRUE, prob = c(0.1, 0.1, 0.5, 0.1)))/1e6

#    1     2     3     4 
#0.124 0.125 0.625 0.125 

table(sample(1:4, 1e6, replace = TRUE, prob = c(0.1, 0.1, 0.5, 0.1)))/1e6

#    1     2     3     4 
#0.125 0.125 0.625 0.125 

As we can see, running multiple times gives the output which is not equal to prob but the results are not random as well. How are the numbers distributed in this case? Where is it documented?

I tried searching on the internet but didn't find any relevant information. I looked through the documentation at ?sample which has

The optional prob argument can be used to give a vector of weights for obtaining the elements of the vector being sampled. They need not sum to one, but they should be non-negative and not all zero. If replace is true, Walker's alias method (Ripley, 1987) is used when there are more than 200 reasonably probable values: this gives results incompatible with those from R < 2.2.0.

So it says that the prob argument need not sum to 1 but doesn't tell what is expected when it doesn't sum to 1? I am not sure if I am missing any part of the documentation. Does anybody have any idea?

like image 442
Ronak Shah Avatar asked Jan 26 '20 13:01

Ronak Shah


2 Answers

Good question. The docs are unclear on this, but the question can be answered by reviewing the source code.

If you look at the R code, sample always calls another R function, sample.int If you pass in a single number x to sample, it will use sample.int to create a vector of integers less than or equal to that number, whereas if x is a vector, it uses sample.int to generate a sample of integers less than or equal to length(x), then uses that to subset x.

Now, if you examine the function sample.int, it looks like this:

function (n, size = n, replace = FALSE, prob = NULL, useHash = (!replace && 
    is.null(prob) && size <= n/2 && n > 1e+07)) 
{
    if (useHash) 
        .Internal(sample2(n, size))
    else .Internal(sample(n, size, replace, prob))
}

The .Internal means any sampling is done by calling compiled code written in C: in this case, it's the function do_sample, defined here in src/main/random.c.

If you look at this C code, do_sample checks whether it has been passed a prob vector. If not, it samples on the assumption of equal weights. If prob exists, the function ensures that it is numeric and not NA. If prob passes these checks, a pointer to the underlying array of doubles is generated and passed to another function in random.c called FixUpProbs, defined here.

This function examines each member of prob and throws an error if any elements of prob are not positive finite doubles. It then normalises the numbers by dividing each by the sum of all. There is therefore no preference at all for prob summing to 1 inherent in the code. That is, even if prob sums to 1 in your input, the function will still calculate the sum and divide each number by it.

Therefore, the parameter is poorly named. It should be "weights", as others here have pointed out. To be fair, the docs only say that prob should be a vector of weights, not absolute probabilities.

So the behaviour of the prob parameter from my reading of the code should be:

  1. prob can be absent altogether, in which case sampling defaults to equal weights.
  2. If any of prob's numbers are less than zero, or are infinite, or NA, the function will throw.
  3. An error should be thrown if any of the prob values are non-numeric, as they will be interpreted as NA in the SEXP passed to the C code.
  4. prob must have the same length as x or the C code throws
  5. You can pass a zero probability as one or more elements of prob if you have specified replace=T, as long as you have at least one non-zero probability.
  6. If you specify replace=F, the number of samples you request must be less than or equal to the number of non-zero elements in prob. Essentially, FixUpProbs will throw if you ask it to sample with a zero probability.
  7. A valid prob vector will be normalised to sum to 1 and used as sampling weights.

As an interesting side effect of this behaviour, this allows you to use odds instead of probabilities if you are choosing between 2 alternatives by setting probs = c(1, odds)

like image 145
Allan Cameron Avatar answered Oct 28 '22 22:10

Allan Cameron


As already mentioned, the weights are normalized to sum to 1 as can be demonstrated:

> x/sum(x)
[1] 0.15384615 0.38461538 0.38461538 0.07692308

This matches your simulated tabulated data:

#     1      2      3      4 
#0.1544 0.3839 0.3848 0.0768 
like image 27
Roman Luštrik Avatar answered Oct 28 '22 22:10

Roman Luštrik