Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iterative or Lazy Reservoir Sampling

I'm fairly well acquainted with using Reservoir Sampling to sample from a set of undetermined length in a single pass over the data. One limitation of this approach, in my mind, is that it still requires a pass over the entire data set before any results can be returned. Conceptually this makes sense, since one has to allow items in the entirety of the sequence the opportunity to replace previously encountered items to achieve a uniform sample.

Is there a way to be able to yield some random results before the entire sequence has been evaluated? I'm thinking of the kind of lazy approach that would fit well with python's great itertools library. Perhaps this could be done within some given error tolerance? I'd appreciate any sort of feedback on this idea!

Just to clarify the question a bit, this diagram sums up my understanding of the in-memory vs. streaming tradeoffs of different sampling techniques. What I want is something that falls into the category of Stream Sampling, where we don't know the length of the population beforehand.

enter image description here

Clearly there is a seeming contradiction in not knowing the length a priori and still getting a uniform sample, since we will most likely bias the sample towards the beginning of the population. Is there a way to quantify this bias? Are there tradeoffs to be made? Does anybody have a clever algorithm to solve this problem?

like image 242
Stankalank Avatar asked Jun 11 '14 21:06

Stankalank


1 Answers

If you know in advance the total number of items that will be yielded by an iterable population, it is possible to yield the items of a sample of population as you come to them (not only after reaching the end). If you don't know the population size ahead of time, this is impossible (as the probability of any item being in the sample can't be be calculated).

Here's a quick generator that does this:

def sample_given_size(population, population_size, sample_size):
    for item in population:
        if random.random() < sample_size / population_size:
            yield item
            sample_size -= 1
        population_size -= 1

Note that the generator yields items in the order they appear in the population (not in random order, like random.sample or most reservoir sampling codes), so a slice of the sample will not be a random subsample!

like image 54
Blckknght Avatar answered Oct 12 '22 23:10

Blckknght