Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Assigning a specific number of values informed by a probability distribution (in R)

Hello and thanks in advance for the help!

I am trying to generate a vector with a specific number of values that are assigned according to a probability distribution. For example, I want a vector of length 31, contained 26 zeroes and 5 ones. (The total sum of the vector should always be five.) However, the location of the ones is important. And to identify which values should be one and which should be zero, I have a vector of probabilities (length 31), which looks like this:

probs<-c(0.01,0.02,0.01,0.02,0.01,0.01,0.01,0.04,0.01,0.01,0.12,0.01,0.02,0.01,
0.14,0.06,0.01,0.01,0.01,0.01,0.01,0.14,0.01,0.07,0.01,0.01,0.04,0.08,0.01,0.02,0.01)

I can select values according to this distribution and get a vector of length 31 using rbinom, but I can't select exactly five values.

Inv=rbinom(length(probs),1,probs)
Inv
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0

Any ideas?

Thanks again!

like image 699
Laura Avatar asked Aug 04 '11 03:08

Laura


People also ask

What is Rbinom in R?

Think of flipping a coin m times, where the coin is weighted to have probability p of landing on heads. The R function rbinom() generates random variables with a binomial distribution. E.g., rbinom(n=20, size=10, prob=0.5) produces 20 observations from Bin(10,0.5).


2 Answers

How about just using a weighted sample.int to select the locations?

Inv<-integer(31)
Inv[sample.int(31,5,prob=probs)]<-1
Inv
[1] 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
like image 52
James Avatar answered Sep 28 '22 17:09

James


Chase provides a great answer and mentions the problem of the run-away while() iteration. One of the problems with a run-away while() is that if you do this one trial at a time, and it takes many, say t, trials to find one that matches the target number of 1s, you incur the overhead of t calls to the main function, rbinom() in this case.

There is a way out, however, because rbinom(), like all of these (pseudo)random number generators in R, is vectorised, we can generate m trials at a time and check those m trials for conformance to the requirements of 5 1s. If none are found, we repeatedly draw m trials until we find one that does conform. This idea is implemented in the function foo() below. The chunkSize argument is m, the number of trials to draw at a time. I also took the opportunity to allow the function to find more than a single conformal trial; argument n controls how many conformal trials to return.

foo <- function(probs, target, n = 1, chunkSize = 100) {
    len <- length(probs)
    out <- matrix(ncol = len, nrow = 0) ## return object
    ## draw chunkSize trials
    trial <- matrix(rbinom(len * chunkSize, 1, probs),
                    ncol = len, byrow = TRUE)
    rs <- rowSums(trial)  ## How manys `1`s
    ok <- which(rs == 5L) ## which meet the `target`
    found <- length(ok)   ## how many meet the target
    if(found > 0)         ## if we found some, add them to out
        out <- rbind(out,
                     trial[ok, , drop = FALSE][seq_len(min(n,found)), , 
                                               drop = FALSE])
    ## if we haven't found enough, repeat the whole thing until we do
    while(found < n) {
        trial <- matrix(rbinom(len * chunkSize, 1, probs),
                            ncol = len, byrow = TRUE)
        rs <- rowSums(trial)
        ok <- which(rs == 5L)
        New <- length(ok)
        if(New > 0) {
            found <- found + New
            out <- rbind(out, trial[ok, , drop = FALSE][seq_len(min(n, New)), , 
                                                        drop = FALSE])
        }
    }
    if(n == 1L)           ## comment this, and
        out <- drop(out)  ## this if you don't want dimension dropping
    out
}

It works like this:

> set.seed(1)
> foo(probs, target = 5)
 [1] 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0
[31] 0
> foo(probs, target = 5, n = 2)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,]    0    0    0    0    0    0    0    0    0     0     0
[2,]    0    0    0    0    0    0    0    0    0     0     1
     [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
[1,]     0     0     0     1     1     0     0     0     0     0
[2,]     0     1     0     0     1     0     0     0     0     0
     [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31]
[1,]     1     0     1     0     0     0     1     0     0     0
[2,]     1     0     1     0     0     0     0     0     0     0

Note that I drop the empty dimension in the case where n == 1. Comment the last if code chunk out if you don't want this feature.

You need to balance the size of chunkSize with the computational burden of checking that many trials at a time. If the requirement (here 5 1s) is very unlikely, then increase chunkSize so you incur fewer calls to rbinom(). If the requirement is likely, there is little point drawing trials and large chunkSize at a time if you only want one or two as you have to evaluate each trial draw.

like image 31
Gavin Simpson Avatar answered Sep 28 '22 18:09

Gavin Simpson