Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can `rand()` in c++ be used to generate unbiased bools?

I have written the following function

bool random_bool(double probability)
{
    double p_scaled = probability * (RAND_MAX+1) - rand();
    if ( p_scaled >= 1 ) return true;
    if ( p_scaled <= 0 ) return false;
    return random_bool( p_scaled );
}

Given, that rand() generates a number from uniform distribution on {0,1,...,RAND_MAX-1,RAND_MAX} and numbers from subsequent calls can be treated as independent for all practical purposes except cryptography, this should return true with probability p: two if statements return true with probability slightly below p, and false with the probability slightly above 1-p, while the recursive call deals with everything else.

However the following test fails:

long long N = 10000000000; //1e10
double p = 10000.0 / N;
int counter = 0;
for (long long i=0;i<N;i++) if (random_bool(p)) counter++;
assert(9672 < counter && counter <= 10330);

The assert statement is designed to fail only in 0.1% of cases. However it fails all the time (with counter being between 10600 and 10700).

What's wrong?

P.S.: I've seen this question, but it doesn't help...

like image 748
fiktor Avatar asked Oct 20 '22 15:10

fiktor


1 Answers

One common defect in random number generators is a slight bias towards smaller results (basically a slight bias towards 0 in high order bits). This often happens when wrapping the RNG internal state to the output range is done using a simple mod, which is biased against high values unless RAND_MAX is a divisor of the size of the internal state. Here's a typical biased mapping implementation:

static unsigned int state;

int rand() {
   state = nextState(); /* this actually moves the state from one random value to the next, eg., using a LCG */
   return state % RAND_MAX;  /* biased */
}

The bias occurs because lower values output an have one more mapping under mod from the state. E.g., if the state can have values 0-9 (10 values), and RAND_MAX is 3 (so values 0-2), then the % 3 operation results in, depending on the state

Output  State
0       0 3 6 9 
1       1 4 7
2       2 5 8

The result 0 is over-represented because it has a 4/10 chance of being selected, vs 3/10 for the other values.

As an example with more likely values, if the internal RNG state is a 16-integer, and RAND_MAX is 35767 (as you mentioned it is on your platform), then all the values [0,6000] will be be output for 3 different state values, but the remaining ~30,000 values will only be output for 2 distinct state values - a significant bias. This kind of bias would tend to cause your counter value to be higher than expected (since smaller than uniform returns from rand() favors the p_scaled >= 1 condition.

It would help if you could post the exact implementation of rand() on your platform. If it turns out to be bias in the high bits, you may be able to eliminate this by passing the values you get from rand() through a good hash function, but a better approach is probably just to use a high quality source of random numbers, e.g., the Mersenne Twister . A better generator will also have a larger output range (effective, a higher RAND_MAX), which means your algorithm will suffer fewer retries/less recursion.

Even if the Visual Studio runtime implementation suffers from this defect, it is worth noting that it was probably at least partly an intentional design choice - using a RAND_MAX like 35767 that is relatively prime to the state size (typically a power of 2), ensures better randomness of the lower bits, since the % operation effectively mixes the high and low order bits - and having biased/non-random low order bits is often a bigger problem in practice than a slight bias in the high order bits because of the ubiquity of the caller of rand() reducing the range using %, which effectively uses only the low order bits for moduli which are powers of 2 (also very common).

like image 154
BeeOnRope Avatar answered Oct 30 '22 02:10

BeeOnRope