Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Eliminating modulo bias: how is it achieved in the arc4random_uniform() function?

Modulo bias is a problem that arises when naively using the modulo operation to get pseudorandom numbers smaller than a given "upper bound".

Therefore as a C programmer I am using a modified version of the arc4random_uniform() function to generate evenly distributed pseudorandom numbers.

The problem is I do not understand how the function works, mathematically.

This is the function's explanatory comment, followed by a link to the full source code:

/*
 * Calculate a uniformly distributed random number less than upper_bound
 * avoiding "modulo bias".
 *
 * Uniformity is achieved by generating new random numbers until the one
 * returned is outside the range [0, 2**32 % upper_bound).  This
 * guarantees the selected random number will be inside
 * [2**32 % upper_bound, 2**32) which maps back to [0, upper_bound)
 * after reduction modulo upper_bound.
 */

http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/lib/libc/crypt/arc4random_uniform.c?rev=1.1&content-type=text/x-cvsweb-markup

From the comment above we can define:

  • [2^32 % upper_bound, 2^32) - interval A
  • [0, upper_bound) - interval B

In order to work, the function relies on the fact that interval A maps to interval B.

My question is: mathematically, how come the numbers in interval A map uniformly to the ones in interval B? And is there a proof for this?

like image 954
Log Avatar asked Aug 27 '15 20:08

Log


1 Answers

Sometimes it helps to start with an easily understood example, and then generalize from there. To keep things simple, let's imagine that arc4random returns a uint8_t instead of a uint32_t, so the output from arc4random is a number in the interval [0,256). And let's choose an upper_bound of 7.

Note that 7 does not divide evenly into 256

256 = 7 * 36 + 4

That means that naively using the modulo operation to get pseudorandom numbers smaller than 7 would result in the following probability distribution

37/256 for outcomes 0,1,2,3
36/256 for outcomes 4,5,6

That's what's known as modulo bias, outcomes 0,1,2,3 are more likely than outcomes 4,5,6.

To avoid modulo bias we could simply reject the values 252,253,254,255, and generate a new number until the result is in the interval [0,252). All the numbers in the interval [0,252) have equal probability (rejecting higher numbers doesn't effect the distribution of the lower numbers). And since 7 divides evenly into 252, the resulting probability distribution is uniform

 36/252 for outcomes 0,1,2,3,4,5,6,7

That's essentially what arc4random_uniform does, except that arc4random_uniform rejects numbers at the bottom of the range. Specifically, interval A would be

[2^8 % 7, 2^8) which is [4, 256)

After generating a number (call it N) in the interval [4,256) the final calculation is

outcome = N % 7

There are 252 numbers in the interval [4,256), and since 252 is a multiple of 7, every outcome on the interval [0,7) has equal probability.


That's how arc4random_uniform works, it rejects/retries on a small range of numbers, and the count of numbers in the remaining range is a multiple of the upper_bound. (Since the upper_bound is typically a small number compared to 2^32, the odds of having multiple retries for a single outcome is quite small.)

But do you really care about modulo bias? In most cases, the answer is, "No". Consider our example with an upper bound of 7. The probability distribution for the naive modulo implementation is

613566757 / 4294967296 for outcomes 0,1,2,3
613566756 / 4294967296 for outcomes 4,5,6

which is a modulo bias of less than 0.0000002%.

So there's your choice: either spend a minuscule amount of time on retries to get a perfect distribution, or accept a minuscule error in the probability distribution to avoid the retries.

like image 120
user3386109 Avatar answered Nov 15 '22 06:11

user3386109