I have a stream of events flowing through my servers. It is not feasible for me to store all of them, but I would like to periodically be able to process some of them in aggregate. So, I want to keep a subset of the stream that is a random sampling of everything I've seen, but is capped to a max size.
So, for each new item, I need an algorithm to decide if I should add it to the stored set, or if I should discard it. If I add it, and I'm already at my limit, I need an algorithm to evict one of the old items.
Obviously, this is easy as long as I'm below my limit (just save everything). But how can I maintain a good random sampling without being biased towards old items or new items once I'm past that limit?
Thanks,
This is a common interview question.
One easy way to do it is to save the nth element with probability k/n (or 1, whichever is lesser). If you need to remove an element to save the new sample, evict a random element.
This gives you a uniformly random subset of the n elements. If you don't know n, you can estimate it and get an approximately uniform subset.
This is called random sampling. Source: http://en.wikipedia.org/wiki/Reservoir_sampling
array R[k]; // result
integer i, j;
// fill the reservoir array
for each i in 1 to k do
R[i] := S[i]
done;
// replace elements with gradually decreasing probability
for each i in k+1 to length(S) do
j := random(1, i); // important: inclusive range
if j <= k then
R[j] := S[i]
fi
done
A decent explanation/proof: http://propersubset.com/2010/04/choosing-random-elements.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With