Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way of randomly choosing a set of distinct integers

I'm looking for the most efficient algorithm to randomly choose a set of n distinct integers, where all the integers are in some range [0..maxValue].

Constraints:

  • maxValue is larger than n, and possibly much larger
  • I don't care if the output list is sorted or not
  • all integers must be chosen with equal probability

My initial idea was to construct a list of the integers [0..maxValue] then extract n elements at random without replacement. But that seems quite inefficient, especially if maxValue is large.

Any better solutions?

like image 884
mikera Avatar asked Sep 15 '10 22:09

mikera


3 Answers

For small values of maxValue such that it is reasonable to generate an array of all the integers in memory then you can use a variation of the Fisher-Yates shuffle except only performing the first n steps.


If n is much smaller than maxValue and you don't wish to generate the entire array then you can use this algorithm:

  1. Keep a sorted list l of number picked so far, initially empty.
  2. Pick a random number x between 0 and maxValue - (elements in l)
  3. For each number in l if it smaller than or equal to x, add 1 to x
  4. Add the adjusted value of x into the sorted list and repeat.

If n is very close to maxValue then you can randomly pick the elements that aren't in the result and then find the complement of that set.


Here is another algorithm that is simpler but has potentially unbounded execution time:

  1. Keep a set s of element picked so far, initially empty.
  2. Pick a number at random between 0 and maxValue.
  3. If the number is not in s, add it to s.
  4. Go back to step 2 until s has n elements.

In practice if n is small and maxValue is large this will be good enough for most purposes.

like image 124
Mark Byers Avatar answered Nov 02 '22 15:11

Mark Byers


Here is an optimal algorithm, assuming that we are allowed to use hashmaps. It runs in O(n) time and space (and not O(maxValue) time, which is too expensive).

It is based on Floyd's random sample algorithm. See my blog post about it for details. The code is in Java:

private static Random rnd = new Random();

public static Set<Integer> randomSample(int max, int n) {
    HashSet<Integer> res = new HashSet<Integer>(n);
    int count = max + 1;
    for (int i = count - n; i < count; i++) {
        Integer item = rnd.nextInt(i + 1);
        if (res.contains(item))
            res.add(i);
        else
            res.add(item);
    }
    return res;
}
like image 13
Eyal Schneider Avatar answered Nov 02 '22 14:11

Eyal Schneider


One way to do it without generating the full array.

Say I want a randomly selected subset of m items from a set {x1, ..., xn} where m <= n.

Consider element x1. I add x1 to my subset with probability m/n.

  • If I do add x1 to my subset then I reduce my problem to selecting (m - 1) items from {x2, ..., xn}.
  • If I don't add x1 to my subset then I reduce my problem to selecting m items from {x2, ..., xn}.

Lather, rinse, and repeat until m = 0.

This algorithm is O(n) where n is the number of items I have to consider.

I rather imagine there is an O(m) algorithm where at each step you consider how many elements to remove from the "front" of the set of possibilities, but I haven't convinced myself of a good solution and I have to do some work now!

like image 2
Rafe Avatar answered Nov 02 '22 14:11

Rafe