I'm looking for a more efficient way to sample from a list of integers 1:n, multiple times, where the probability vector (also length n) is different each time. For 20 trials with n = 10, I know one can do it like this:
probs <- matrix(runif(200), nrow = 20)
answers <- numeric(20)
for(i in 1:20) answers[i] <- sample(10,1,prob=probs[i,])
But that calls sample 10 times just to get a single number each time, so it's is presumably not the fastest way. Speed would be helpful as the code will be doing this plenty of times.
Many thanks!
Luke
Edit: Big thanks to Roman, whose idea about benchmarking helped me find a good solution. I've now moved this to the answer.
Just for fun, I tried two more versions. On what scale are you doing this sampling? I think all of these are pretty fast and more or less equivalent (I haven't included the creation of probs for your solution). Would love to see others take a shot at this.
library(rbenchmark)
benchmark(replications = 1000,
          luke = for(i in 1:20) answers[i] <- sample(10,1,prob=probs[i,]),
          roman = apply(probs, MARGIN = 1, FUN = function(x) sample(10, 1, prob = x)),
          roman2 = replicate(20, sample(10, 1, prob = runif(10))))
    test replications elapsed relative user.self sys.self user.child sys.child
1   luke         1000    0.41    1.000      0.42        0         NA        NA
2  roman         1000    0.47    1.146      0.46        0         NA        NA
3 roman2         1000    0.47    1.146      0.44        0         NA        NA
Here's another approach that I found. It's fast, but not as fast as simply calling sample many times with a for loop. I initially thought it was very good, but I was using benchmark() incorrectly.
luke2 = function(probs) { # takes a matrix of probability vectors, each in its own row
                probs <- probs/rowSums(probs) 
                probs <- t(apply(probs,1,cumsum)) 
                answer <- rowSums(probs - runif(nrow(probs)) < 0) + 1 
                return(answer)  }
Here's how it works: picture the probabilities as lines of various lengths laid out on a number line from 0 to 1. The big probabilities will take up more of the number line than the small ones. You could then pick the outcome by picking a random point on the number line - the big probabilities will have more likelihood of being chosen. The advantage of this approach is that you can roll all the random numbers needed in one call of runif(), instead of calling sample over and over as in the functions luke, roman and roman2. However, it looks like the extra data processing slows it down and the costs more than offset this benefit.
library(rbenchmark)
probs <- matrix(runif(2000), ncol = 10)
answers <- numeric(200)
benchmark(replications = 1000,
          luke = for(i in 1:20) answers[i] <- sample(10,1,prob=probs[i,]),
          luke2 = luke2(probs),
          roman = apply(probs, MARGIN = 1, FUN = function(x) sample(10, 1, prob = x)),
          roman2 = replicate(20, sample(10, 1, prob = runif(10))))
              roman = apply(probs, MARGIN = 1, FUN = function(x) sample(10, 1, prob = x)),
              roman2 = replicate(20, sample(10, 1, prob = runif(10))))
    test replications elapsed relative user.self sys.self user.child sys.child
    1   luke         1000   0.171    1.000     0.166    0.005          0         0
    2  luke2         1000   0.529    3.094     0.518    0.012          0         0
    3  roman         1000   1.564    9.146     1.513    0.052          0         0
    4 roman2         1000   0.225    1.316     0.213    0.012          0         0
For some reason, apply() does very badly as you add more rows. I don't understand why, because I thought it was a wrapper for for() and should therefore roman() should perform similarly to luke().
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With