Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to keep a random subset of a stream of data?

I have a stream of events flowing through my servers. It is not feasible for me to store all of them, but I would like to periodically be able to process some of them in aggregate. So, I want to keep a subset of the stream that is a random sampling of everything I've seen, but is capped to a max size.

So, for each new item, I need an algorithm to decide if I should add it to the stored set, or if I should discard it. If I add it, and I'm already at my limit, I need an algorithm to evict one of the old items.

Obviously, this is easy as long as I'm below my limit (just save everything). But how can I maintain a good random sampling without being biased towards old items or new items once I'm past that limit?

Thanks,

like image 492
twk Avatar asked Dec 02 '10 21:12

twk


2 Answers

This is a common interview question.

One easy way to do it is to save the nth element with probability k/n (or 1, whichever is lesser). If you need to remove an element to save the new sample, evict a random element.

This gives you a uniformly random subset of the n elements. If you don't know n, you can estimate it and get an approximately uniform subset.

like image 147
jonderry Avatar answered Sep 18 '22 15:09

jonderry


This is called random sampling. Source: http://en.wikipedia.org/wiki/Reservoir_sampling

array R[k];    // result
integer i, j;

// fill the reservoir array
for each i in 1 to k do
    R[i] := S[i]
done;

// replace elements with gradually decreasing probability
for each i in k+1 to length(S) do
    j := random(1, i);   // important: inclusive range
    if j <= k then
        R[j] := S[i]
    fi
done

A decent explanation/proof: http://propersubset.com/2010/04/choosing-random-elements.html

like image 22
Chris Dunder Avatar answered Sep 20 '22 15:09

Chris Dunder