Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient random number generation from a truncated normal distribution

I would like to sample 50,000 values from the normal distribution with mean = 0 and sd -1. But I want to limit the values to [-3,3]. I've written code to do this, but not sure if it is most efficient? Was hoping to get some suggestions.

lower <- -3 
upper <- 3
x_norm<-rnorm(75000,0,1)
x_norm<-x_norm[which(x_norm >=lower & x_norm<=upper)]
repeat{
    x_norm<-c(x_norm, rnorm(10000,0,1))
    x_norm<-x_norm[which(x_norm >=lower & x_norm<=upper)]
    if(length(x_norm) >= 50000){break}
}
x_norm<-x_norm[1:50000]
like image 893
user1375871 Avatar asked Dec 25 '12 21:12

user1375871


2 Answers

John and Dirk gave nice examples of rejection sampling, which should be fine for the given question. But to give another approach, when you have the cumulative distribution function and its inverse (or reasonable approximations thereof) you can just generate data from a uniform distribution and transform:

x <- qnorm( runif(50000, pnorm(-3), pnorm(3)) )
range(x)
hist(x)

For the given question I don't expect this to be much better (if any better) than the rejection sampling methods, but if you wanted to generate data between 2 and 3 from a truncated normal 0,1 then this method would probably be much more efficient. It does depend on the cumulative and its inverse (pnorm and qnorm in this case) and so would not be as simple as the rejection sampling for a distribution without those easily available.

like image 51
Greg Snow Avatar answered Nov 26 '22 06:11

Greg Snow


Something like your code will most definitely work but you're greatly overestimating how many values you need. Given that it's a known distribution and a fairly large number of samples you know how many will appear more or less than 3.

(1-pnorm(3))*2 * 50000
[1] 134.9898

So, given you likely only get about 135 out of range in a draw of 50,000 it's pretty easy to draw a few more than that, but still not an inordinately larger number and trim it. Just take the first 50,000 of 50,500 that are less than or greater than 3.

x <- rnorm(50500)
x <- x[x < 3 & x > -3]
x <- x[1:50000]

I ran the first 2 lines 40,000 times and it returned a length greater than 50000 every time. A small boolean check could guarantee it always does.

x <- 1
while (length(x) < 50000){
    x <- rnorm(50500)
    x <- x[x < 3 & x > -3]}
x <- x[1:50000]

For me this executes almost 100% of the time in 6 ms. It's a simple way to do it in R that executes very fast, is easy to read, and doesn't require add ons.

like image 22
John Avatar answered Nov 26 '22 06:11

John