I have seen this question asked a lot but never seen a true concrete answer to it. So I am going to post one here which will hopefully help people understand why exactly there is "modulo bias" when using a random number generator, like rand()
in C++.
Instead, random-number generators that are often used to produce such sequences tend to cluster zeros together, introducing a bias.
With some random number generators, it's possible to select the seed carefully to manipulate the output. Sometimes this is easy to do. Sometimes it's hard but doable. Sometimes it's theoretically possible but practically impossible.
Surprisingly, the general-purpose random number generators that are in most widespread use are easily predicted. (In contrast RNGs used to construct stream ciphers for secure communication are believed to be infeasible to predict, and are known as cryptographically secure).
Random numbers are important for computer encryption, lotteries, scientific modelling, and gambling. Current methods of generating random numbers can produce predictable results. Researchers said the new method could generate higher-quality random numbers with less computer processing.
So rand()
is a pseudo-random number generator which chooses a natural number between 0 and RAND_MAX
, which is a constant defined in cstdlib
(see this article for a general overview on rand()
).
Now what happens if you want to generate a random number between say 0 and 2? For the sake of explanation, let's say RAND_MAX
is 10 and I decide to generate a random number between 0 and 2 by calling rand()%3
. However, rand()%3
does not produce the numbers between 0 and 2 with equal probability!
When rand()
returns 0, 3, 6, or 9, rand()%3 == 0
. Therefore, P(0) = 4/11
When rand()
returns 1, 4, 7, or 10, rand()%3 == 1
. Therefore, P(1) = 4/11
When rand()
returns 2, 5, or 8, rand()%3 == 2
. Therefore, P(2) = 3/11
This does not generate the numbers between 0 and 2 with equal probability. Of course for small ranges this might not be the biggest issue but for a larger range this could skew the distribution, biasing the smaller numbers.
So when does rand()%n
return a range of numbers from 0 to n-1 with equal probability? When RAND_MAX%n == n - 1
. In this case, along with our earlier assumption rand()
does return a number between 0 and RAND_MAX
with equal probability, the modulo classes of n would also be equally distributed.
So how do we solve this problem? A crude way is to keep generating random numbers until you get a number in your desired range:
int x; do { x = rand(); } while (x >= n);
but that's inefficient for low values of n
, since you only have a n/RAND_MAX
chance of getting a value in your range, and so you'll need to perform RAND_MAX/n
calls to rand()
on average.
A more efficient formula approach would be to take some large range with a length divisible by n
, like RAND_MAX - RAND_MAX % n
, keep generating random numbers until you get one that lies in the range, and then take the modulus:
int x; do { x = rand(); } while (x >= (RAND_MAX - RAND_MAX % n)); x %= n;
For small values of n
, this will rarely require more than one call to rand()
.
Works cited and further reading:
CPlusPlus Reference
Eternally Confuzzled
Keep selecting a random is a good way to remove the bias.
Update
We could make the code fast if we search for an x in range divisible by n
.
// Assumptions // rand() in [0, RAND_MAX] // n in (0, RAND_MAX] int x; // Keep searching for an x in a range divisible by n do { x = rand(); } while (x >= RAND_MAX - (RAND_MAX % n)) x %= n;
The above loop should be very fast, say 1 iteration on average.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With