Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Uniform distribution of integers using floating point source

The standard way to get a random integer in the range [0, n) in JavaScript - or any other language that only offers a random() function that returns a float in the range [0,1) - is to use Math.floor(Math.random() * n).

Now the math behind this is trivial assuming we're operating on the set of rational numbers. The question is: With all the complications of IEEE-754 floating point numbers is the resulting distribution actually really uniform?

Considering that the gap between one floating point number and the next higher one increases as they grow larger I would think that this should introduce some kind of bias towards smaller numbers.

like image 868
Voo Avatar asked Sep 02 '15 18:09

Voo


2 Answers

No, the resulting distribution is not going to be perfectly uniform, for most values of n. For small values, it'll be so close to uniform that you'd have a hard time detecting any difference from a uniform distribution, but as n gets larger the bias can become noticeable.

To illustrate, here's some Python code (not JavaScript, sorry, but the principle is the same):

from collections import Counter
from random import random

def badrand(n):
    return int(random() * n)

print(Counter(badrand(6755399441055744) % 3 for _ in range(10000000)))

This is producing 10 million random integers in the range [0, 6755399441055744), reducing each of those integers modulo 3, and counting the number of times the remainder is 0, 1, or 2. If we're generating those integers uniformly, we'd expect the remainders modulo 3 to be roughly evenly distributed, so we'd expect the counts to be similar.

Here's an example result from running this on my machine:

Counter({1: 3751915, 0: 3334643, 2: 2913442})

That is, a remainder of 1 is significantly more likely to occur than 0, which in turn is significantly more likely to occur than a remainder of 2. The differences here are way too big to be explained by random variation.

So what went wrong? Python's random() function is relatively high quality, based on the Mersenne Twister, so we're unlikely to be seeing statistical problems resulting from the base random number generator. What's happening is that random() generates one of 2^53 (roughly) equally likely outcomes - each outcome is a number of the form x / 2^53 for some integer x in the range [0, 2^53). Now in the badrand call, we're effectively mapping those outcomes to 6755399441055744 possible outputs. Now that value wasn't chosen at random (ha!); it's exactly 3/4 of 2^53. That means that under the most uniform distribution possible, 2/3 of the possible badrand output values are being hit by exactly one of the 2^53 possible random() output values, while the other 1/3 are being hit by two of the 2^53 possible random() output values. That is, some of the potential outputs are twice as likely to occur as others. So we're a long way from uniform.

You're going to see the same effect in JavaScript. In the case of Chrome, it appears that there are only 2^32 distinct results from Math.random(), so you should be able to find effects like the above with n smaller than (but close to) 2^32.

Of course, the same effect holds for small n, too: if n = 5, then because 5 is not a divisor of 2^32 there's no way we can perfectly evenly distribute all 2^32 possible Math.random() results between the 5 desired outcomes: the best we can hope for is that 4 of the 5 outcomes appear for 858993459 of the possible random() results each, while the fifth occurs for 858993460 of the random() results. But that distribution is going to be so close to uniform that it would be well-nigh impossible to find any statistical test to tell you differently. So for practical purposes, you should be safe with small n.

There's a related Python bug that might be interesting at http://bugs.python.org/issue9025. That bug was solved for Python 3 by moving away from the int(random() * n) method of computing these numbers. The bug still remains in Python 2, though.

like image 93
Mark Dickinson Avatar answered Oct 21 '22 10:10

Mark Dickinson


If Math.random (or equivalent) generated a uniformly-distributed bit pattern out of those bit patterns corresponding to floating point numbers in the range [0, 1), then it would produce an extremely biased sample. There are as many representable floating point number in [0.25, 0.5) as there are in [0.5, 1.0), which is also the same number of representable values in [0.125, 0.25). And so on. In short, the uniformly-distributed bit patterns would result in only one out of a thousand values being between 0.5 and 1.0. (assuming double-precision floating point numbers.)

Fortunately, that's not what Math.random does. One simple way of getting a uniformly distributed number (rather than bit pattern) is to generate a uniformly distributed bit pattern in [1.0, 2.0), and then subtract 1.0; that's a fairly common strategy.

Regardless, the end result of Math.floor(Math.random() * n) is not quite uniformly distributed unless n is a power of 2, because of quantification bias. The number of possible floating point values which could be returned by Math.random is a power of 2, and if n is not a power of 2, then it is impossible to distribute the possible floating point values precisely evenly over all values of integers in [0, n). If Math.random returns a double-precision floating pointer number and n is not huge, this bias is small, but it certainly exists.

like image 24
rici Avatar answered Oct 21 '22 12:10

rici