Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

re-using random number in reservoir sampling

It was asked in relation to another question recently: Given an unknown length list, return a random item in it by scanning it only 1 time

I know you shouldn't, I just can't put my finger on a canonical explanation of why not.

Look at the example code:

import random, sys

def rnd(): # a function that returns a random number each call
    return int(random.getrandbits(32))

class fixed: # a functor that returns the same random number each call
    def __init__(self):
        self._ret = rnd()
    def __call__(self):
        return self._ret

def sample(rnd,seq_size):
    choice = 0
    for item in xrange(1,seq_size):
        if (rnd() % (item+1)) == 0:
            choice = item
    return choice

dist = [0 for i in xrange(500)]
for i in xrange(1000):
    dist[sample(rnd,len(dist))] += 1
print "real",dist
print

dist = [0 for i in xrange(500)]
for i in xrange(1000):
    dist[sample(fixed(),len(dist))] += 1
print "reuse",dist

The choices for the proper reservoir sampling that generates a new random number per item is nicely evenly distributed as it should be:

real [1, 3, 0, 1, 2, 3, 2, 3, 1, 2, 2, 2, 2, 0, 0, 1, 3, 3, 4, 0, 2, 1, 2, 1, 1, 4, 0, 3, 1, 1, 2, 0, 0, 0, 1, 4, 6, 2, 3, 1, 1, 3, 2, 1, 3, 3, 1, 4, 1, 1, 2, 2, 5, 1, 2, 1, 0, 3, 1, 0, 2, 6, 1, 2, 2, 1, 1, 1, 1, 3, 2, 1, 5, 4, 0, 3, 3, 4, 0, 0, 2, 1, 3, 2, 3, 0, 2, 4, 6, 3, 0, 1, 3, 0, 2, 2, 4, 3, 2, 1, 2, 1, 2, 2, 1, 4, 2, 0, 0, 1, 1, 0, 1, 4, 2, 2, 2, 1, 0, 3, 1, 2, 1, 0, 2, 2, 1, 5, 1, 5, 3, 3, 1, 0, 2, 2, 0, 3, 2, 3, 0, 1, 1, 3, 0, 1, 2, 2, 0, 1, 2, 2, 3, 2, 3, 1, 1, 0, 1, 2, 2, 2, 2, 2, 3, 2, 1, 2, 2, 2, 1, 3, 3, 1, 0, 1, 1, 0, 1, 3, 2, 1, 4, 3, 4, 1, 1, 1, 2, 1, 2, 0, 0, 0, 1, 1, 2, 6, 0, 1, 1, 0, 1, 0, 1, 2, 2, 3, 0, 1, 2, 2, 1, 0, 4, 2, 1, 2, 2, 0, 4, 4, 0, 3, 2, 2, 1, 2, 4, 1, 2, 1, 0, 2, 1, 1, 5, 1, 2, 2, 3, 2, 3, 0, 1, 2, 3, 2, 5, 2, 3, 0, 1, 1, 1, 1, 3, 4, 2, 4, 1, 2, 3, 2, 5, 2, 1, 0, 1, 1, 2, 2, 3, 1, 1, 1, 2, 1, 2, 0, 4, 1, 1, 2, 3, 4, 3, 1, 2, 3, 3, 3, 2, 1, 2, 0, 0, 4, 3, 2, 2, 5, 5, 3, 3, 3, 1, 0, 1, 3, 1, 1, 2, 4, 3, 1, 4, 4, 2, 5, 0, 5, 4, 2, 1, 0, 4, 1, 3, 3, 2, 4, 2, 3, 3, 1, 3, 3, 4, 2, 2, 1, 1, 1, 1, 3, 3, 5, 3, 2, 4, 0, 1, 3, 2, 2, 4, 2, 2, 3, 4, 5, 3, 2, 1, 2, 3, 2, 2, 2, 4, 4, 0, 1, 3, 3, 3, 4, 1, 2, 4, 0, 4, 0, 3, 2, 1, 1, 4, 2, 1, 0, 0, 0, 4, 2, 2, 1, 4, 3, 1, 1, 3, 2, 4, 3, 4, 2, 1, 1, 2, 2, 3, 3, 1, 2, 2, 1, 1, 2, 3, 1, 9, 1, 3, 4, 2, 4, 4, 0, 1, 0, 1, 0, 2, 1, 0, 1, 2, 3, 3, 6, 2, 2, 1, 2, 4, 3, 3, 3, 2, 1, 2, 1, 2, 8, 2, 3, 1, 5, 3, 0, 2, 1, 1, 4, 2, 2, 1, 2, 3, 2, 1, 0, 4, 3, 4, 3, 1, 3, 2, 3, 2, 2, 1, 0, 1, 2, 5, 3, 0, 3, 1, 2, 2, 2, 1, 0, 1, 4]

Whereas when you re-use the same random number for all items, you get a distribution skewed to the very low numbers:

reuse [92, 50, 34, 19, 23, 16, 13, 9, 9, 9, 11, 10, 6, 7, 8, 5, 5, 6, 4, 2, 2, 3, 2, 3, 3, 6, 6, 1, 4, 3, 5, 2, 2, 1, 1, 2, 3, 4, 3, 4, 1, 3, 1, 0, 0, 1, 5, 3, 1, 2, 0, 2, 0, 1, 1, 6, 2, 0, 2, 2, 4, 2, 2, 0, 2, 2, 2, 0, 3, 0, 4, 1, 2, 1, 4, 2, 2, 0, 1, 0, 1, 1, 0, 0, 0, 2, 0, 0, 2, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 2, 1, 3, 1, 0, 1, 2, 0, 4, 3, 0, 0, 2, 0, 0, 1, 0, 0, 2, 0, 2, 1, 0, 1, 0, 0, 1, 1, 3, 0, 1, 1, 0, 2, 0, 1, 2, 0, 1, 1, 4, 1, 1, 1, 2, 1, 0, 1, 2, 0, 2, 1, 1, 2, 0, 1, 1, 0, 2, 0, 2, 0, 0, 2, 0, 1, 0, 2, 1, 1, 0, 0, 1, 2, 4, 1, 0, 2, 0, 1, 2, 1, 3, 0, 1, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 3, 2, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 4, 1, 0, 2, 1, 0, 0, 2, 1, 1, 3, 3, 2, 0, 1, 0, 2, 0, 1, 1, 0, 0, 3, 1, 0, 0, 1, 0, 3, 2, 2, 0, 0, 0, 0, 0, 2, 0, 1, 0, 2, 0, 4, 1, 0, 0, 2, 0, 1, 1, 0, 0, 3, 1, 3, 2, 2, 1, 3, 1, 2, 0, 1, 1, 3, 0, 3, 1, 2, 0, 2, 0, 2, 0, 3, 0, 3, 0, 3, 1, 0, 2, 3, 1, 1, 0, 1, 3, 3, 1, 1, 1, 0, 2, 1, 1, 4, 1, 1, 1, 2, 0, 3, 1, 1, 0, 4, 1, 1, 0, 1, 3, 1, 0, 1, 1, 0, 3, 3, 0, 2, 4, 0, 1, 2, 1, 6, 1, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 4, 2, 0, 1, 2, 0, 1, 4, 1, 2, 0, 5, 2, 2, 0, 6, 2, 2, 1, 3, 0, 3, 1, 1, 0, 3, 1, 4, 2, 0, 1, 0, 1, 2, 3, 1, 1, 3, 0, 0, 0, 1, 1, 4, 3, 3, 0, 0, 1, 0, 1, 1, 2, 1, 0, 2, 1, 4, 5, 1, 1, 3, 0, 1, 1, 1, 3, 1, 1, 0, 3, 3, 1, 3, 0, 1, 0, 0, 1, 1, 3, 2, 1, 0, 3, 1, 1, 3, 1, 3, 1, 2, 2, 2, 0, 0, 5, 1, 3, 0, 1, 4, 1, 1, 1, 3, 2, 1, 3, 2, 1, 3, 1, 2, 2, 3, 2, 2, 1, 0, 3, 3, 1, 3, 3, 3, 2, 1, 2, 3, 3, 3, 1, 2, 2, 2, 4, 2, 1, 5, 2, 2, 0]

What's the maths here? Why can't you re-use the same random number?

like image 880
Will Avatar asked Jan 02 '12 16:01

Will


1 Answers

Edit, in response to comment:

The way reservoir sampling should work is: you want to select exactly the right proportion of samples from each of the existing bins in order to make up an additional bin with the same probability. In your sample() loop, given that you have randomly sampled one of item bins, you need to select samples from each bin with probability 1/(item + 1).

However, with fixed(), both the selection decision and the previous bin number depend on the same fixed 32-bit number. This means that the likelihood that a sample is removed from each of the bins will not be uniform.


Consider what happens during the third iteration of the sample() loop. You have three existing bins (0, 1, and 2), and you want to pick 1/4 of the samples in each and add them to a newly created bin 3.

Note that all the 32-bit fixed() numbers in bin 1 will be even (because the first pass selected all numbers divisible by 2), and all the numbers in bin 0 will be odd (because the even ones were moved to bin 1). The second pass moves all numbers divisible by three to bin 2 (which should be OK so far, and does not change the even/odd division in bins 0 and 1).

The third pass then moves all fixed() numbers divisible by 4 into bin 3. But, this will select half the numbers from bin 1 (because half of all even numbers are divisible by 4), and none of the numbers from bin 0 (because they are all odd). So, even though the expected size of the new bin should be correct, the expected sizes of the old bins are no longer the same.

That is how fixed() generates an uneven distribution: the implicit assumption, that you can select an exact fraction of each bin by choosing a random number, is violated if that number depends in a predictable way on the numbers used to choose the original bin.


The basic property of random numbers is that each sample must be independently distributed from the preceding samples, in a statistical sense. Algorithms based on random numbers depend on this property.

Pseudo-random number generators (PRNG's) are not actually random; as you know, their results are actually a fixed sequence. The PRNG results are deliberately scrambled so that they act enough like actual random numbers for most purposes. However, if the PRNG is "weak" for a particular application, the inner workings of the PRNG can interact with the details of the application in odd ways, to very non-random results.

What you're trying out here, by re-using the same random number, is building a bad PRNG. The actual results depend on the details of how the application uses the random numbers...

Even though fixed() is an intentionally broken PRNG, many commercial library PRNG's are "weak", and can end up with similar weird interactions with some applications. As a practical matter, "weakness" is relative to the application -- and, while there are statistical tests that are widely used to try to expose weak PRNG's, there is no guarantee that your application won't stumble on some odd correlation of even a "strong" PRNG.

like image 145
comingstorm Avatar answered Oct 13 '22 07:10

comingstorm