It was asked in relation to another question recently: Given an unknown length list, return a random item in it by scanning it only 1 time
I know you shouldn't, I just can't put my finger on a canonical explanation of why not.
Look at the example code:
import random, sys
def rnd(): # a function that returns a random number each call
return int(random.getrandbits(32))
class fixed: # a functor that returns the same random number each call
def __init__(self):
self._ret = rnd()
def __call__(self):
return self._ret
def sample(rnd,seq_size):
choice = 0
for item in xrange(1,seq_size):
if (rnd() % (item+1)) == 0:
choice = item
return choice
dist = [0 for i in xrange(500)]
for i in xrange(1000):
dist[sample(rnd,len(dist))] += 1
print "real",dist
print
dist = [0 for i in xrange(500)]
for i in xrange(1000):
dist[sample(fixed(),len(dist))] += 1
print "reuse",dist
The choices for the proper reservoir sampling that generates a new random number per item is nicely evenly distributed as it should be:
real [1, 3, 0, 1, 2, 3, 2, 3, 1, 2, 2, 2, 2, 0, 0, 1, 3, 3, 4, 0, 2, 1, 2, 1, 1, 4, 0, 3, 1, 1, 2, 0, 0, 0, 1, 4, 6, 2, 3, 1, 1, 3, 2, 1, 3, 3, 1, 4, 1, 1, 2, 2, 5, 1, 2, 1, 0, 3, 1, 0, 2, 6, 1, 2, 2, 1, 1, 1, 1, 3, 2, 1, 5, 4, 0, 3, 3, 4, 0, 0, 2, 1, 3, 2, 3, 0, 2, 4, 6, 3, 0, 1, 3, 0, 2, 2, 4, 3, 2, 1, 2, 1, 2, 2, 1, 4, 2, 0, 0, 1, 1, 0, 1, 4, 2, 2, 2, 1, 0, 3, 1, 2, 1, 0, 2, 2, 1, 5, 1, 5, 3, 3, 1, 0, 2, 2, 0, 3, 2, 3, 0, 1, 1, 3, 0, 1, 2, 2, 0, 1, 2, 2, 3, 2, 3, 1, 1, 0, 1, 2, 2, 2, 2, 2, 3, 2, 1, 2, 2, 2, 1, 3, 3, 1, 0, 1, 1, 0, 1, 3, 2, 1, 4, 3, 4, 1, 1, 1, 2, 1, 2, 0, 0, 0, 1, 1, 2, 6, 0, 1, 1, 0, 1, 0, 1, 2, 2, 3, 0, 1, 2, 2, 1, 0, 4, 2, 1, 2, 2, 0, 4, 4, 0, 3, 2, 2, 1, 2, 4, 1, 2, 1, 0, 2, 1, 1, 5, 1, 2, 2, 3, 2, 3, 0, 1, 2, 3, 2, 5, 2, 3, 0, 1, 1, 1, 1, 3, 4, 2, 4, 1, 2, 3, 2, 5, 2, 1, 0, 1, 1, 2, 2, 3, 1, 1, 1, 2, 1, 2, 0, 4, 1, 1, 2, 3, 4, 3, 1, 2, 3, 3, 3, 2, 1, 2, 0, 0, 4, 3, 2, 2, 5, 5, 3, 3, 3, 1, 0, 1, 3, 1, 1, 2, 4, 3, 1, 4, 4, 2, 5, 0, 5, 4, 2, 1, 0, 4, 1, 3, 3, 2, 4, 2, 3, 3, 1, 3, 3, 4, 2, 2, 1, 1, 1, 1, 3, 3, 5, 3, 2, 4, 0, 1, 3, 2, 2, 4, 2, 2, 3, 4, 5, 3, 2, 1, 2, 3, 2, 2, 2, 4, 4, 0, 1, 3, 3, 3, 4, 1, 2, 4, 0, 4, 0, 3, 2, 1, 1, 4, 2, 1, 0, 0, 0, 4, 2, 2, 1, 4, 3, 1, 1, 3, 2, 4, 3, 4, 2, 1, 1, 2, 2, 3, 3, 1, 2, 2, 1, 1, 2, 3, 1, 9, 1, 3, 4, 2, 4, 4, 0, 1, 0, 1, 0, 2, 1, 0, 1, 2, 3, 3, 6, 2, 2, 1, 2, 4, 3, 3, 3, 2, 1, 2, 1, 2, 8, 2, 3, 1, 5, 3, 0, 2, 1, 1, 4, 2, 2, 1, 2, 3, 2, 1, 0, 4, 3, 4, 3, 1, 3, 2, 3, 2, 2, 1, 0, 1, 2, 5, 3, 0, 3, 1, 2, 2, 2, 1, 0, 1, 4]
Whereas when you re-use the same random number for all items, you get a distribution skewed to the very low numbers:
reuse [92, 50, 34, 19, 23, 16, 13, 9, 9, 9, 11, 10, 6, 7, 8, 5, 5, 6, 4, 2, 2, 3, 2, 3, 3, 6, 6, 1, 4, 3, 5, 2, 2, 1, 1, 2, 3, 4, 3, 4, 1, 3, 1, 0, 0, 1, 5, 3, 1, 2, 0, 2, 0, 1, 1, 6, 2, 0, 2, 2, 4, 2, 2, 0, 2, 2, 2, 0, 3, 0, 4, 1, 2, 1, 4, 2, 2, 0, 1, 0, 1, 1, 0, 0, 0, 2, 0, 0, 2, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 2, 1, 3, 1, 0, 1, 2, 0, 4, 3, 0, 0, 2, 0, 0, 1, 0, 0, 2, 0, 2, 1, 0, 1, 0, 0, 1, 1, 3, 0, 1, 1, 0, 2, 0, 1, 2, 0, 1, 1, 4, 1, 1, 1, 2, 1, 0, 1, 2, 0, 2, 1, 1, 2, 0, 1, 1, 0, 2, 0, 2, 0, 0, 2, 0, 1, 0, 2, 1, 1, 0, 0, 1, 2, 4, 1, 0, 2, 0, 1, 2, 1, 3, 0, 1, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 3, 2, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 4, 1, 0, 2, 1, 0, 0, 2, 1, 1, 3, 3, 2, 0, 1, 0, 2, 0, 1, 1, 0, 0, 3, 1, 0, 0, 1, 0, 3, 2, 2, 0, 0, 0, 0, 0, 2, 0, 1, 0, 2, 0, 4, 1, 0, 0, 2, 0, 1, 1, 0, 0, 3, 1, 3, 2, 2, 1, 3, 1, 2, 0, 1, 1, 3, 0, 3, 1, 2, 0, 2, 0, 2, 0, 3, 0, 3, 0, 3, 1, 0, 2, 3, 1, 1, 0, 1, 3, 3, 1, 1, 1, 0, 2, 1, 1, 4, 1, 1, 1, 2, 0, 3, 1, 1, 0, 4, 1, 1, 0, 1, 3, 1, 0, 1, 1, 0, 3, 3, 0, 2, 4, 0, 1, 2, 1, 6, 1, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 4, 2, 0, 1, 2, 0, 1, 4, 1, 2, 0, 5, 2, 2, 0, 6, 2, 2, 1, 3, 0, 3, 1, 1, 0, 3, 1, 4, 2, 0, 1, 0, 1, 2, 3, 1, 1, 3, 0, 0, 0, 1, 1, 4, 3, 3, 0, 0, 1, 0, 1, 1, 2, 1, 0, 2, 1, 4, 5, 1, 1, 3, 0, 1, 1, 1, 3, 1, 1, 0, 3, 3, 1, 3, 0, 1, 0, 0, 1, 1, 3, 2, 1, 0, 3, 1, 1, 3, 1, 3, 1, 2, 2, 2, 0, 0, 5, 1, 3, 0, 1, 4, 1, 1, 1, 3, 2, 1, 3, 2, 1, 3, 1, 2, 2, 3, 2, 2, 1, 0, 3, 3, 1, 3, 3, 3, 2, 1, 2, 3, 3, 3, 1, 2, 2, 2, 4, 2, 1, 5, 2, 2, 0]
What's the maths here? Why can't you re-use the same random number?
Edit, in response to comment:
The way reservoir sampling should work is: you want to select exactly the right proportion of samples from each of the existing bins in order to make up an additional bin with the same probability. In your sample()
loop, given that you have randomly sampled one of item
bins, you need to select samples from each bin with probability 1/(item + 1)
.
However, with fixed()
, both the selection decision and the previous bin number depend on the same fixed 32-bit number. This means that the likelihood that a sample is removed from each of the bins will not be uniform.
Consider what happens during the third iteration of the sample()
loop. You have three existing bins (0, 1, and 2), and you want to pick 1/4 of the samples in each and add them to a newly created bin 3.
Note that all the 32-bit fixed()
numbers in bin 1 will be even (because the first pass selected all numbers divisible by 2), and all the numbers in bin 0 will be odd (because the even ones were moved to bin 1). The second pass moves all numbers divisible by three to bin 2 (which should be OK so far, and does not change the even/odd division in bins 0 and 1).
The third pass then moves all fixed()
numbers divisible by 4 into bin 3. But, this will select half the numbers from bin 1 (because half of all even numbers are divisible by 4), and none of the numbers from bin 0 (because they are all odd). So, even though the expected size of the new bin should be correct, the expected sizes of the old bins are no longer the same.
That is how fixed()
generates an uneven distribution: the implicit assumption, that you can select an exact fraction of each bin by choosing a random number, is violated if that number depends in a predictable way on the numbers used to choose the original bin.
The basic property of random numbers is that each sample must be independently distributed from the preceding samples, in a statistical sense. Algorithms based on random numbers depend on this property.
Pseudo-random number generators (PRNG's) are not actually random; as you know, their results are actually a fixed sequence. The PRNG results are deliberately scrambled so that they act enough like actual random numbers for most purposes. However, if the PRNG is "weak" for a particular application, the inner workings of the PRNG can interact with the details of the application in odd ways, to very non-random results.
What you're trying out here, by re-using the same random number, is building a bad PRNG. The actual results depend on the details of how the application uses the random numbers...
Even though fixed()
is an intentionally broken PRNG, many commercial library PRNG's are "weak", and can end up with similar weird interactions with some applications. As a practical matter, "weakness" is relative to the application -- and, while there are statistical tests that are widely used to try to expose weak PRNG's, there is no guarantee that your application won't stumble on some odd correlation of even a "strong" PRNG.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With