Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Programing Pearls - Random Select algorithm

Tags:

algorithm

Page 120 of Programming Pearls 1st edition presents this algorithm for selecting M equally probable random elements out of a population of N integers.

InitToEmpty
Size := 0
While Size < M do
  T := RandInt(1,N)
  if not Member(T)    
     Insert(T)
     Size := Size + 1

It is stated that the expected number of Member tests is less than 2M, as long as M < N/2.

I'd like to know how to prove it, but my algorithm analysis background is failing me.

I understand that the closer M is to N, the longer the program will take, because the result set will have more elements and the likelihood of RandInt selecting an existing one will increase proportionally.

Can you help me figuring out this proof?

like image 682
mpm Avatar asked Jul 19 '10 16:07

mpm


1 Answers

I am not a math wizard, but I will give it a rough shot. This is NOT guaranteed to be right though.

For each additional member of M, you pick a number, see if it's there, and if is add it. Otherwise, you try again. Trying something until you're successful is called a geometric probability distribution.

http://en.wikipedia.org/wiki/Geometric_distribution

So you are running M geometric trials. Each trial has expected value 1/p, so will take expected 1/p tries to get a number not already in M. p is N minus the number of numbers we've already added from M divided by N (i.e. how many unpicked items / total items). So for the fourth number, p = (N -3) / N, which is the probability of picking an unused number, so the expected number of picks for the third number is N / N-3 .

The expected value of the run time is all of these added together. So something like

E(run time) = N/N + N/(N -1) + N/(N -2 ) ... + N/ (N-M)

Now if M < N/2, then the last element in that summation is bounded above by 2. ((N/N/2) == 2)). It's also obviously the largest element in the whole summation. So if the biggest element is two picks, and there are M elements being summed, the EV of the whole run time is bounded above by 2M.

Ask me if any of this is unclear. Correct me if any of this is wrong :)

like image 133
Bill Prin Avatar answered Nov 15 '22 12:11

Bill Prin