Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Uniformly and randomly choose M elements from N elements - confused

I understand similar questions have been asked like

Choose m elements randomly from a vector containing n elements

Pick N items at random from sequence of unknown length

But the more I look the more I got confused.

Uniformly and randomly choose M elements from N elements

So I need to pick M elements from N ones. And also I need to make the probability of being selected to be uniformly distributed for every element: M/N

My intuition was

  1. Randomly choose an element
  2. Get it out
  3. Repeat the process on the rest of elements

I guess this solution is wrong? The probabilities for the M selected elements are 1/N, 1/(N-1), ..., 1/(N-M), not M/N, am I correct?


A possible correct solution is that

  1. Just do a shuffle on the whole N elements
  2. Pick the top M ones.

But I can't figure out the probability of being selected for each element.

Can anyone demonstrate the probability of this solution is M/N indeed?


Also of course we can use Reservoir sampling, and its probability is M / N.

like image 384
Jackson Tale Avatar asked Feb 11 '23 12:02

Jackson Tale


2 Answers

I think your confusion there may be that you're not discriminating between choosing a sequence, from choosing a set.

In your first procedure, just because you only have a 1/N chance of choosing a particular element in the first round, doesn't mean you won't choose it in a subsequent round. The element has a 1/N chance of being the first element in the result... but it has a M/N chance of being chosen during some round. So that works. Take M=2, N=4: The chance of an element being picked is 1/4 + (3/4)*(1/3) = 2/4.

As for your second procedure, following the shuffle, each element's position within the array is uniformly distributed, so there's a M/N chance that its position is equal to or less than M (and is hence chosen). So that works too.

like image 189
Sneftel Avatar answered May 12 '23 21:05

Sneftel


The probabilities for elements being picked:

First: 1/N;
Second: 1/(N-1) * (1 - 1/N) = 1/N
...

where (1 - 1/N) is the probability of the second item not being picked at the first step, which is the condition for it to be picked at the second step. Here 1/N is the probability of some element being selected at the first step, and (1 - 1/N) - probability of element selected at the second step being available for selection at the second step.

Third: 1/(N-2) * (1 - 1/N - 1/N) = 1/N

where (1 - 1/N - 1/N) is the probability of an element being available for selection at the third step.

And so on.

The point here is that for any element the probability of selection at a step is 1/N.

like image 31
Igor Avatar answered May 12 '23 22:05

Igor