I'm fairly well acquainted with using Reservoir Sampling to sample from a set of undetermined length in a single pass over the data. One limitation of this approach, in my mind, is that it still requires a pass over the entire data set before any results can be returned. Conceptually this makes sense, since one has to allow items in the entirety of the sequence the opportunity to replace previously encountered items to achieve a uniform sample.
Is there a way to be able to yield some random results before the entire sequence has been evaluated? I'm thinking of the kind of lazy approach that would fit well with python's great itertools library. Perhaps this could be done within some given error tolerance? I'd appreciate any sort of feedback on this idea!
Just to clarify the question a bit, this diagram sums up my understanding of the in-memory vs. streaming tradeoffs of different sampling techniques. What I want is something that falls into the category of Stream Sampling, where we don't know the length of the population beforehand.
Clearly there is a seeming contradiction in not knowing the length a priori and still getting a uniform sample, since we will most likely bias the sample towards the beginning of the population. Is there a way to quantify this bias? Are there tradeoffs to be made? Does anybody have a clever algorithm to solve this problem?
If you know in advance the total number of items that will be yielded by an iterable population
, it is possible to yield the items of a sample of population
as you come to them (not only after reaching the end). If you don't know the population size ahead of time, this is impossible (as the probability of any item being in the sample can't be be calculated).
Here's a quick generator that does this:
def sample_given_size(population, population_size, sample_size):
for item in population:
if random.random() < sample_size / population_size:
yield item
sample_size -= 1
population_size -= 1
Note that the generator yields items in the order they appear in the population (not in random order, like random.sample
or most reservoir sampling codes), so a slice of the sample will not be a random subsample!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With