Is there an algorithm for how to perform reservoir sampling when the points in the data stream have associated weights?
The algorithm by Pavlos Efraimidis and Paul Spirakis solves exactly this problem. The original paper with complete proofs is published with the title "Weighted random sampling with a reservoir" in Information Processing Letters 2006, but you can find a simple summary here.
The algorithm works as follows. First observe that another way to solve the unweighted reservoir sampling is to assign to each element a random id R between 0 and 1 and incrementally (say with a heap) keep track of the top k ids. Now let's look at weighted version, and let's say the i-th element has weight w_i. Then, we modify the algorithm by choosing the id of the i-th element to be R^(1/w_i) where R is again uniformly distributed in (0,1).
Another article talking about this algorithm is this one by the Cloudera folks.
You can try the A-ES algorithm from this paper of S. Efraimidis. It's quite simple to code and very efficient.
Hope this helps,
Benoit
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With