Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

possible bug in `rbinom()` for large numbers of trials

Tags:

random

r

I've been writing some code that iteratively performs binomial draws (using rbinom) and for some callee arguments I can end up with the size being large, which causes R (3.1.1, both official or homebrew builds tested—so unlikely to be compiler related) to return an unexpected NA. For example:

rbinom(1,2^32,0.95)

is what I'd expect to work, but gives NA back. However, running with size=2^31 or prob≤0.5 works.

The fine manual mentions inversion being used when size < .Machine$integer.max is false, could this be the issue?

like image 420
Sam Mason Avatar asked Oct 07 '14 16:10

Sam Mason


3 Answers

Looking at the source rbinom does the equivalent (in C code) of the following for such large sizes:

qbinom(runif(n), size, prob, FALSE)

And indeed:

set.seed(42)
rbinom(1,2^31,0.95)
#[1] 2040095619
set.seed(42)
qbinom(runif(1), 2^31, 0.95, F)
#[1] 2040095619

However:

set.seed(42)
rbinom(1,2^32,0.95)
#[1] NA
set.seed(42)
qbinom(runif(1), 2^32, 0.95, F)
#[1] 4080199349

As @BenBolker points out rbinom returns an integer and if the return value is larger than .Machine$integer.max, e.g., larger than 2147483647 on my machine, NA gets returned. In contrast qbinom returns a double. I don't know why and it doesn't seem to be documented.

So, it seems like there is at least undocumented behavior and you should probably report it.

like image 165
Roland Avatar answered Oct 01 '22 00:10

Roland


I agree that (in the absence of documentation saying this is a problem) that this is a bug. A reasonable workaround would be using the Normal approximation, which should be very very good indeed (and faster) for such large values. (I originally meant this to be short and simple but it ended up getting a little bit out of hand.)

rbinom_safe <- function(n,size,prob,max.size=2^31) {
    maxlen <- max(length(size),length(prob),n)
    prob <- rep(prob,length.out=maxlen)
    size <- rep(size,length.out=maxlen)
    res <- numeric(n)
    bigvals <- size>max.size
    if (nbig <- sum(bigvals>0)) {
        m <- (size*prob)[bigvals]
        sd <- sqrt(size*prob*(1-prob))[bigvals]
        res[bigvals] <- round(rnorm(nbig,mean=m,sd=sd))
    }
    if (nbig<n) {
        res[!bigvals] <- rbinom(n-nbig,size[!bigvals],prob[!bigvals])
    }
    return(res)
}

set.seed(101)
size <- c(1,5,10,2^31,2^32)
rbinom_safe(5,size,prob=0.95)
rbinom_safe(5,3,prob=0.95)
rbinom_safe(5,2^32,prob=0.95)

The Normal approximation should work reasonably well whenever the mean is many standard deviations away from 0 or 1 (whichever is closer). For large N this should be OK unless p is very extreme. For example:

n <- 2^31
p <- 0.95
m <- n*p
sd <- sqrt(n*p*(1-p))
set.seed(101)![enter image description here][1]
rr <- rbinom_safe(10000,n,prob=p)
hist(rr,freq=FALSE,col="gray",breaks=50)
curve(dnorm(x,mean=m,sd=sd),col=2,add=TRUE)
dd <- round(seq(m-5*sd,m+5*sd,length.out=101))
midpts <- (dd[-1]+dd[-length(dd)])/2
lines(midpts,c(diff(sapply(dd,pbinom,size=n,prob=p))/diff(dd)[1]),
      col="blue",lty=2)

enter image description here

like image 27
Ben Bolker Avatar answered Sep 30 '22 23:09

Ben Bolker


This is the intended behaviour, but there are two issues: 1) The NA induced by coercion should raise a warning 2) The fact that discrete random variables have storage mode integer should be documented.

I have fixed 1) and will modify the documentation to fix 2) when I have a little more time.

like image 37
Martyn Plummer Avatar answered Oct 01 '22 00:10

Martyn Plummer