Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an algorithm for weighted reservoir sampling? [closed]

Is there an algorithm for how to perform reservoir sampling when the points in the data stream have associated weights?

like image 229
Budhapest Avatar asked Jun 14 '13 21:06

Budhapest


2 Answers

The algorithm by Pavlos Efraimidis and Paul Spirakis solves exactly this problem. The original paper with complete proofs is published with the title "Weighted random sampling with a reservoir" in Information Processing Letters 2006, but you can find a simple summary here.

The algorithm works as follows. First observe that another way to solve the unweighted reservoir sampling is to assign to each element a random id R between 0 and 1 and incrementally (say with a heap) keep track of the top k ids. Now let's look at weighted version, and let's say the i-th element has weight w_i. Then, we modify the algorithm by choosing the id of the i-th element to be R^(1/w_i) where R is again uniformly distributed in (0,1).

Another article talking about this algorithm is this one by the Cloudera folks.

like image 174
had00b Avatar answered Sep 18 '22 17:09

had00b


You can try the A-ES algorithm from this paper of S. Efraimidis. It's quite simple to code and very efficient.

Hope this helps,

Benoit

like image 32
bmat06 Avatar answered Sep 18 '22 17:09

bmat06