Hello and thanks in advance for the help! I am trying to generate a vector with a specific number of values that are assigned according to a probability distribution. For example, I want a vector of length 31, contained 26 zeroes and 5 ones. (The total sum of the vector should always be five.) However, the location of the ones is important. And to identify which values should be one and which should be zero, I have a vector of probabilities (length 31), which looks like this: <pre class="prettyprint"><code>probs<-c(0.01,0.02,0.01,0.02,0.01,0.01,0.01,0.04,0.01,0.01,0.12,0.01,0.02,0.01, 0.14,0.06,0.01,0.01,0.01,0.01,0.01,0.14,0.01,0.07,0.01,0.01,0.04,0.08,0.01,0.02,0.01) </code></pre> I can select values according to this distribution and get a vector of length 31 using rbinom, but I can't select exactly five values. <pre class="prettyprint"><code>Inv=rbinom(length(probs),1,probs) Inv [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 </code></pre> Any ideas? Thanks again!

How about just using a weighted <code>sample.int</code> to select the locations? <pre class="prettyprint"><code>Inv<-integer(31) Inv[sample.int(31,5,prob=probs)]<-1 Inv [1] 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 </code></pre>

Chase provides a great answer and mentions the problem of the run-away <code>while()</code> iteration. One of the problems with a run-away <code>while()</code> is that if you do this one trial at a time, and it takes many, say t, trials to find one that matches the target number of <code>1</code>s, you incur the overhead of t calls to the main function, <code>rbinom()</code> in this case. There is a way out, however, because <code>rbinom()</code>, like all of these (pseudo)random number generators in R, is vectorised, we can generate m trials at a time and check those m trials for conformance to the requirements of 5 <code>1</code>s. If none are found, we repeatedly draw m trials until we find one that does conform. This idea is implemented in the function <code>foo()</code> below. The <code>chunkSize</code> argument is m, the number of trials to draw at a time. I also took the opportunity to allow the function to find more than a single conformal trial; argument <code>n</code> controls how many conformal trials to return. <pre class="prettyprint"><code>foo <- function(probs, target, n = 1, chunkSize = 100) { len <- length(probs) out <- matrix(ncol = len, nrow = 0) ## return object ## draw chunkSize trials trial <- matrix(rbinom(len * chunkSize, 1, probs), ncol = len, byrow = TRUE) rs <- rowSums(trial) ## How manys `1`s ok <- which(rs == 5L) ## which meet the `target` found <- length(ok) ## how many meet the target if(found > 0) ## if we found some, add them to out out <- rbind(out, trial[ok, , drop = FALSE][seq_len(min(n,found)), , drop = FALSE]) ## if we haven't found enough, repeat the whole thing until we do while(found < n) { trial <- matrix(rbinom(len * chunkSize, 1, probs), ncol = len, byrow = TRUE) rs <- rowSums(trial) ok <- which(rs == 5L) New <- length(ok) if(New > 0) { found <- found + New out <- rbind(out, trial[ok, , drop = FALSE][seq_len(min(n, New)), , drop = FALSE]) } } if(n == 1L) ## comment this, and out <- drop(out) ## this if you don't want dimension dropping out } </code></pre> It works like this: <pre class="prettyprint"><code>> set.seed(1) > foo(probs, target = 5) [1] 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 [31] 0 > foo(probs, target = 5, n = 2) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [1,] 0 0 0 0 0 0 0 0 0 0 0 [2,] 0 0 0 0 0 0 0 0 0 0 1 [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [1,] 0 0 0 1 1 0 0 0 0 0 [2,] 0 1 0 0 1 0 0 0 0 0 [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [1,] 1 0 1 0 0 0 1 0 0 0 [2,] 1 0 1 0 0 0 0 0 0 0 </code></pre> Note that I drop the empty dimension in the case where <code>n == 1</code>. Comment the last <code>if</code> code chunk out if you don't want this feature. You need to balance the size of <code>chunkSize</code> with the computational burden of checking that many trials at a time. If the requirement (here 5 <code>1</code>s) is very unlikely, then increase <code>chunkSize</code> so you incur fewer calls to <code>rbinom()</code>. If the requirement is likely, there is little point drawing trials and large <code>chunkSize</code> at a time if you only want one or two as you have to evaluate each trial draw.

Assigning a specific number of values informed by a probability distribution (in R)

Tags:

r

vector

probability

Hello and thanks in advance for the help!

I am trying to generate a vector with a specific number of values that are assigned according to a probability distribution. For example, I want a vector of length 31, contained 26 zeroes and 5 ones. (The total sum of the vector should always be five.) However, the location of the ones is important. And to identify which values should be one and which should be zero, I have a vector of probabilities (length 31), which looks like this:

probs<-c(0.01,0.02,0.01,0.02,0.01,0.01,0.01,0.04,0.01,0.01,0.12,0.01,0.02,0.01,
0.14,0.06,0.01,0.01,0.01,0.01,0.01,0.14,0.01,0.07,0.01,0.01,0.04,0.08,0.01,0.02,0.01)

I can select values according to this distribution and get a vector of length 31 using rbinom, but I can't select exactly five values.

Inv=rbinom(length(probs),1,probs)
Inv
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0

Any ideas?

Thanks again!

699

asked Aug 04 '11 03:08

Laura

2 Answers

How about just using a weighted sample.int to select the locations?

Inv<-integer(31)
Inv[sample.int(31,5,prob=probs)]<-1
Inv
[1] 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

answered Sep 28 '22 17:09

James

Chase provides a great answer and mentions the problem of the run-away while() iteration. One of the problems with a run-away while() is that if you do this one trial at a time, and it takes many, say t, trials to find one that matches the target number of 1s, you incur the overhead of t calls to the main function, rbinom() in this case.

There is a way out, however, because rbinom(), like all of these (pseudo)random number generators in R, is vectorised, we can generate m trials at a time and check those m trials for conformance to the requirements of 5 1s. If none are found, we repeatedly draw m trials until we find one that does conform. This idea is implemented in the function foo() below. The chunkSize argument is m, the number of trials to draw at a time. I also took the opportunity to allow the function to find more than a single conformal trial; argument n controls how many conformal trials to return.

foo <- function(probs, target, n = 1, chunkSize = 100) {
    len <- length(probs)
    out <- matrix(ncol = len, nrow = 0) ## return object
    ## draw chunkSize trials
    trial <- matrix(rbinom(len * chunkSize, 1, probs),
                    ncol = len, byrow = TRUE)
    rs <- rowSums(trial)  ## How manys `1`s
    ok <- which(rs == 5L) ## which meet the `target`
    found <- length(ok)   ## how many meet the target
    if(found > 0)         ## if we found some, add them to out
        out <- rbind(out,
                     trial[ok, , drop = FALSE][seq_len(min(n,found)), , 
                                               drop = FALSE])
    ## if we haven't found enough, repeat the whole thing until we do
    while(found < n) {
        trial <- matrix(rbinom(len * chunkSize, 1, probs),
                            ncol = len, byrow = TRUE)
        rs <- rowSums(trial)
        ok <- which(rs == 5L)
        New <- length(ok)
        if(New > 0) {
            found <- found + New
            out <- rbind(out, trial[ok, , drop = FALSE][seq_len(min(n, New)), , 
                                                        drop = FALSE])
        }
    }
    if(n == 1L)           ## comment this, and
        out <- drop(out)  ## this if you don't want dimension dropping
    out
}

It works like this:

> set.seed(1)
> foo(probs, target = 5)
 [1] 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0
[31] 0
> foo(probs, target = 5, n = 2)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,]    0    0    0    0    0    0    0    0    0     0     0
[2,]    0    0    0    0    0    0    0    0    0     0     1
     [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
[1,]     0     0     0     1     1     0     0     0     0     0
[2,]     0     1     0     0     1     0     0     0     0     0
     [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31]
[1,]     1     0     1     0     0     0     1     0     0     0
[2,]     1     0     1     0     0     0     0     0     0     0

Note that I drop the empty dimension in the case where n == 1. Comment the last if code chunk out if you don't want this feature.

You need to balance the size of chunkSize with the computational burden of checking that many trials at a time. If the requirement (here 5 1s) is very unlikely, then increase chunkSize so you incur fewer calls to rbinom(). If the requirement is likely, there is little point drawing trials and large chunkSize at a time if you only want one or two as you have to evaluate each trial draw.

answered Sep 28 '22 18:09

Gavin Simpson

Related questions
                            
                                How do I use a macro variable in R? (Similar to %LET in SAS)
                            
                                Setting *only* column names in Rcpp
                            
                                Add axis tick-marks on top and to the right to a ggplot?
                            
                                Extract text in parentheses in R
                            
                                Efficient Way to Incrementally Count Unique Data Points in Data Frame
                            
                                Converting Column Values into Their Own Binary Encoded Columns (Dummy Variables)
                            
                                RStudio HiDPI support
                            
                                r - using summarise_each() to count records ignoring NAs
                            
                                Adding value from one data.frame to another data.frame by matching a variable
                            
                                Loading SpatialPolygonsDataFrame with Leaflet (for R) doesn't work
                            
                                Simulating a timeseries in dplyr instead of using a for loop
                            
                                Remove first character from a string in data frame column [duplicate]
                            
                                Problems with installation R packages
                            
                                Subset by row and column reciprocity [duplicate]
                            
                                How to divide a number of columns by one column?
                            
                                How to define multiple variables with lapply?
                            
                                when to use map() function and when to use summarise_at()/mutate_at()
                            
                                R CRAN Check fail when using parallel functions
                            
                                Getting the unique count of strings from a text string
                            
                                Extracting column names with condition from a data frame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With