Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditional Random Sample in R

Tags:

random

r

sampling

I am wondering what the best way to solve this is. Essentially I want to generate 20 samples which add to 100 but also where (x1+x2>20). I am struggling to get something that is fast and efficient. I realise that I could filter out the lines that don't meet this criteria but it isn't efficient if I generate 10,000 rather than 20.

The code is as below:

n = 20
x1 = sample(0:100,n,replace = TRUE)
x2 = sample(0:100,n,replace = TRUE)
x3 = sample(0:100,n,replace = TRUE)
index = (x1+x2+x3)>100
G=(x1+x2)>20
while(sum(index)>0&&sum(G)>0){
   x1[index&&G] = sample(0:100,n,replace = TRUE)
   x2[index&&G] = sample(0:100,n,replace = TRUE)
   x3[index&&G] = sample(0:100,n,replace = TRUE)
index =(x1+x2+x3)>100
G=(x1+x2)>20
}
x4=rep(100,n)-x1-x2-x3

df <- data.frame(x1,x2,x3,x4)

Thanks in advance.

like image 772
swalk88 Avatar asked Jun 23 '26 11:06

swalk88


2 Answers

Another possibility: Pick three breaks of the sequence 0:100. x1, x2, x3 and x4 are then generated in-between those breaks. If x1 + x2 is smaller than 20, then x3 + x4 is greater than 20, so we can swap them.

generate_four_numbers <- function(from = 0, to = 100) {
    breaks <- sort(sample(seq(from, to), 3 ,replace = TRUE))
    x1 <- breaks[1]
    x2 <- breaks[2] - breaks[1]
    x3 <- breaks[3] - breaks[2]
    x4 <- 100 - breaks[3]

    if (x1 + x2 <= 20) {
        return(data.frame(x1 = x4, x2 = x3, x3 = x2, x4 = x1)
    }

    data.frame(x1, x2, x3, x4)
}

res <- do.call(rbind, lapply(1:10000, function(x) generate_four_numbers()))

table(rowSums(res)) # all at 100

length(which(res$x1 + res$x2 > 20)) / nrow(res) # 100 % acceptable
like image 83
Guillaume Devailly Avatar answered Jun 25 '26 05:06

Guillaume Devailly


Here is an unbiased way to pick k numbers in the range 0:n which sum to n. It is based on the stars and bars encoding:

#picks k random numbers in range 0:n which sum to n:

pick <- function(k,n){
  m <- n + k - 1 #number of stars and bars
  bars <- sort(sample(1:m,k-1)) #positions of the bars
  c(bars,m+1)-c(0,bars)-1
}

This generates a single example, returning a vector. As @Guillaume Devailly observes in their answer, most of the samples will satisfy the additional constraint on the sum of the first 2 numbers, so you can just filter out those that don't.

Note that if you want 4 numbers in the range 1:100 which sum to 100 you could just use 1 + pick(4,96).

To enforce the constraint on the first two numbers:

pick.sample <- function(){
  while(TRUE){
    x <- pick(4,100)
    if(sum(x[1:2]) >20) return(x)
  }
}

Then

df <- data.frame(t(replicate(10000,pick.sample())))

will create a 10,000 row dataframe where each row is a sample which satisfies the constraints.

like image 45
John Coleman Avatar answered Jun 25 '26 04:06

John Coleman