possible bug in `rbinom()` for large numbers of trials

Question

I've been writing some code that iteratively performs binomial draws (using rbinom) and for some callee arguments I can end up with the size being large, which causes R (3.1.1, both official or homebrew builds tested—so unlikely to be compiler related) to return an unexpected NA. For example:

rbinom(1,2^32,0.95)

is what I'd expect to work, but gives NA back. However, running with size=2^31 or prob≤0.5 works.

The fine manual mentions inversion being used when size < .Machine$integer.max is false, could this be the issue?

Roland · Accepted Answer

Looking at the source rbinom does the equivalent (in C code) of the following for such large sizes:

qbinom(runif(n), size, prob, FALSE)

And indeed:

set.seed(42)
rbinom(1,2^31,0.95)
#[1] 2040095619
set.seed(42)
qbinom(runif(1), 2^31, 0.95, F)
#[1] 2040095619

However:

set.seed(42)
rbinom(1,2^32,0.95)
#[1] NA
set.seed(42)
qbinom(runif(1), 2^32, 0.95, F)
#[1] 4080199349

As @BenBolker points out rbinom returns an integer and if the return value is larger than .Machine$integer.max, e.g., larger than 2147483647 on my machine, NA gets returned. In contrast qbinom returns a double. I don't know why and it doesn't seem to be documented.

So, it seems like there is at least undocumented behavior and you should probably report it.

Ben Bolker · Answer

I agree that (in the absence of documentation saying this is a problem) that this is a bug. A reasonable workaround would be using the Normal approximation, which should be very very good indeed (and faster) for such large values. (I originally meant this to be short and simple but it ended up getting a little bit out of hand.)

rbinom_safe <- function(n,size,prob,max.size=2^31) {
    maxlen <- max(length(size),length(prob),n)
    prob <- rep(prob,length.out=maxlen)
    size <- rep(size,length.out=maxlen)
    res <- numeric(n)
    bigvals <- size>max.size
    if (nbig <- sum(bigvals>0)) {
        m <- (size*prob)[bigvals]
        sd <- sqrt(size*prob*(1-prob))[bigvals]
        res[bigvals] <- round(rnorm(nbig,mean=m,sd=sd))
    }
    if (nbig<n) {
        res[!bigvals] <- rbinom(n-nbig,size[!bigvals],prob[!bigvals])
    }
    return(res)
}

set.seed(101)
size <- c(1,5,10,2^31,2^32)
rbinom_safe(5,size,prob=0.95)
rbinom_safe(5,3,prob=0.95)
rbinom_safe(5,2^32,prob=0.95)

The Normal approximation should work reasonably well whenever the mean is many standard deviations away from 0 or 1 (whichever is closer). For large N this should be OK unless p is very extreme. For example:

n <- 2^31
p <- 0.95
m <- n*p
sd <- sqrt(n*p*(1-p))
set.seed(101)![enter image description here][1]
rr <- rbinom_safe(10000,n,prob=p)
hist(rr,freq=FALSE,col="gray",breaks=50)
curve(dnorm(x,mean=m,sd=sd),col=2,add=TRUE)
dd <- round(seq(m-5*sd,m+5*sd,length.out=101))
midpts <- (dd[-1]+dd[-length(dd)])/2
lines(midpts,c(diff(sapply(dd,pbinom,size=n,prob=p))/diff(dd)[1]),
      col="blue",lty=2)

enter image description here

Martyn Plummer · Answer

This is the intended behaviour, but there are two issues: 1) The NA induced by coercion should raise a warning 2) The fact that discrete random variables have storage mode integer should be documented.

I have fixed 1) and will modify the documentation to fix 2) when I have a little more time.

possible bug in `rbinom()` for large numbers of trials

Tags:

random

r

Sam Mason

3 Answers

Roland

Ben Bolker

Martyn Plummer

Recent Activity

Donate For Us

possible bug in `rbinom()` for large numbers of trials

Tags:

random

r

Sam Mason

3 Answers

Roland

Ben Bolker

Martyn Plummer

Related questions

Recent Activity

Donate For Us