I need to select n
records at random from a set of N
(where 0 < n < N
).
A possible algorithm is:
Iterate through the list and for each element, make the probability of selection =
(number needed) / (number left)
So if you had 40 items, the first would have a
5/40
chance of being selected.If it is, the next has a
4/39
chance, otherwise it has a5/39
chance. By the time you get to the end you will have your 5 items, and often you'll have all of them before that.
Assuming a good pseudo-random number generator, is this algorithm correct?
NOTE
There're many questions of this kind on stackoverflow (a lot of them are marked as duplicates of Select N random elements from a List<T> in C#).
The above algorithm is often proposed (e.g. Kyle Cronin's answer) and it's always questioned (e.g. see here, here, here, here...).
Can I have a final word about the matter?
The algorithm is absolutely correct.
It's not the sudden invention of a good poster, it's a well known technique called Selection sampling / Algorithm S (discovered by Fan, Muller and Rezucha (1) and independently by Jones (2) in 1962), well described in TAOCP - Volume 2 - Seminumerical Algorithms - § 3.4.2.
As Knuth says:
This algorithm may appear to be unreliable at first glance and, in fact, to be incorrect. But a careful analysis shows that it is completely trustworthy.
The algorithm samples n
elements from a set of size N
and the t + 1
st element is chosen with probability (n - m) / (N - t)
, when already m
elements have been chosen.
It's easy to see that we never run off the end of the set before choosing n
items (as the probability will be 1
when we have k
elements to choose from the remaining k
elements).
Also we never pick too many elements (the probability will be 0
as soon n == m
).
It's a bit harder to demonstrate that the sample is completely unbiased, but it's
... true in spite of the fact that we are not selecting the
t + 1
st item with probabilityn / N
. This has caused some confusion in the published literature
(so not just on Stackoverflow!)
The fact is we should not confuse conditional and unconditional probabilities:
For example consider the second element; if the first element was selected in the sample (this happen with probability
n / N
), the second element is selected with probability(n - 1) / (N - 1)
; if the first element was not selected, the second element is selected with probabilityn / (N - 1)
.The overall probability of selecting the second element is
(n / N) ((n - 1) / (N - 1)) + (1 - n/N)(n / (N - 1)) = n/N
.
TAOCP - Vol 2 - Section 3.4.2 exercise 3
Apart from theoretical considerations, Algorithm S (and algorithm R / reservoir sampling) is used in many well known libraries (e.g. SGI's original STL implementation, std::experimental::sample
,
random.sample
in Python...).
Of course algorithm S is not always the best answer:
O(N)
(even if we will usually not have to pass over all N
records: the average number of records considered when n=2
is about 2/3 N
; the general formulas are given in
TAOCP - Vol 2 - § 3.4.2 - ex 5/6);N
isn't known in advance.Anyway it works!
EDIT
how do you randomly select this item, with a probability of 7/22
[CUT]
In rare cases, you might even pick 4 or 6 elements when you wanted 5
This is from N3925 (small modifications to avoid the common interface / tag dispatch):
template<class PopIter, class SampleIter, class Size, class URNG>
SampleIter sample(PopIter first, PopIter last, SampleIter out, Size n, URNG &&g)
{
using dist_t = uniform_int_distribution<Size>;
using param_t = typename dist_t::param_type;
dist_t d{};
Size unsampled_sz = distance(first, last);
for (n = min(n, unsampled_sz); n != 0; ++first)
{
param_t const p{0, --unsampled_sz};
if (d(g, p) < n) { *out++ = *first; --n; }
}
return out;
}
There aren't floats.
uniform_int_distribution
"works as advertised" there is no bias.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With