Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select n records at random from a set of N

I need to select n records at random from a set of N (where 0 < n < N).

A possible algorithm is:

Iterate through the list and for each element, make the probability of selection = (number needed) / (number left)

So if you had 40 items, the first would have a 5/40 chance of being selected.

If it is, the next has a 4/39 chance, otherwise it has a 5/39 chance. By the time you get to the end you will have your 5 items, and often you'll have all of them before that.

Assuming a good pseudo-random number generator, is this algorithm correct?


NOTE

There're many questions of this kind on stackoverflow (a lot of them are marked as duplicates of Select N random elements from a List<T> in C#).

The above algorithm is often proposed (e.g. Kyle Cronin's answer) and it's always questioned (e.g. see here, here, here, here...).

Can I have a final word about the matter?

like image 590
manlio Avatar asked Jan 28 '16 15:01

manlio


1 Answers

The algorithm is absolutely correct.

It's not the sudden invention of a good poster, it's a well known technique called Selection sampling / Algorithm S (discovered by Fan, Muller and Rezucha (1) and independently by Jones (2) in 1962), well described in TAOCP - Volume 2 - Seminumerical Algorithms - § 3.4.2.

As Knuth says:

This algorithm may appear to be unreliable at first glance and, in fact, to be incorrect. But a careful analysis shows that it is completely trustworthy.

The algorithm samples n elements from a set of size N and the t + 1st element is chosen with probability (n - m) / (N - t), when already m elements have been chosen.

It's easy to see that we never run off the end of the set before choosing n items (as the probability will be 1 when we have k elements to choose from the remaining k elements).

Also we never pick too many elements (the probability will be 0 as soon n == m).

It's a bit harder to demonstrate that the sample is completely unbiased, but it's

... true in spite of the fact that we are not selecting the t + 1st item with probability n / N. This has caused some confusion in the published literature

(so not just on Stackoverflow!)

The fact is we should not confuse conditional and unconditional probabilities:

For example consider the second element; if the first element was selected in the sample (this happen with probability n / N), the second element is selected with probability (n - 1) / (N - 1); if the first element was not selected, the second element is selected with probability n / (N - 1).

The overall probability of selecting the second element is (n / N) ((n - 1) / (N - 1)) + (1 - n/N)(n / (N - 1)) = n/N.

TAOCP - Vol 2 - Section 3.4.2 exercise 3

Apart from theoretical considerations, Algorithm S (and algorithm R / reservoir sampling) is used in many well known libraries (e.g. SGI's original STL implementation, std::experimental::sample, random.sample in Python...).

Of course algorithm S is not always the best answer:

  • it's O(N) (even if we will usually not have to pass over all N records: the average number of records considered when n=2 is about 2/3 N; the general formulas are given in TAOCP - Vol 2 - § 3.4.2 - ex 5/6);
  • it cannot be used when the value of N isn't known in advance.

Anyway it works!


  1. C. T. Fan, M. E. Muller and I. Rezucha, J. Amer. Stat. Assoc. 57 (1962), pp 387 - 402
  2. T. G. Jones, CACM 5 (1962), pp 343

EDIT

how do you randomly select this item, with a probability of 7/22

[CUT]

In rare cases, you might even pick 4 or 6 elements when you wanted 5

This is from N3925 (small modifications to avoid the common interface / tag dispatch):

template<class PopIter, class SampleIter, class Size, class URNG>
SampleIter sample(PopIter first, PopIter last, SampleIter out, Size n, URNG &&g)
{
  using dist_t = uniform_int_distribution<Size>;
  using param_t = typename dist_t::param_type;

  dist_t d{};

  Size unsampled_sz = distance(first, last);
  for (n = min(n, unsampled_sz); n != 0;  ++first)
  {
    param_t const p{0, --unsampled_sz};

    if (d(g, p) < n) { *out++ = *first; --n; }
  }

  return out;
}

There aren't floats.

  • If you need 5 elements you get 5 elements;
  • if uniform_int_distribution "works as advertised" there is no bias.
like image 98
manlio Avatar answered Sep 21 '22 01:09

manlio