I have this problem
You are given a set S of n numbers. You must pick a subset S′ of k numbers from S such that the probability of each element of S occurring in S′ is equal (i.e., each is selected with probability k/n). You may make only one pass over the numbers. What if n is unknown?
and I even have a solution: http://www.algorithm.cs.sunysb.edu/algowiki/index.php/TADM2E_2.43
Still: I don't understand the problem text at all.
I'm given a set S of n numbers. Fine. I need to pick a subset (2^n subsets are possible) of k numbers such that the probability of each element of S occurring in S' is equal... the obvious answer to me would be to just grab the empty S' set: each element in S would have 0 probability to be in S'.
If this is not acceptable (and it should have been stated), I suppose I should count for the most-recurring element (appearing T times) in S and make every other element have exactly T instances in S' (it should still be a subset if the elements are contained in S).
I don't really understand the priority queue solution, nor the k/n probability. Can somebody help me out with this?
This is a very famous problem with a resulting technique called Reservoir Sampling - a very useful algorithm for stream processing of big data. The preceding link can give you the precise setting, motivation, and explanation of the solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With