I have an external collection containing n elements that I want to select some number (k) of them at random, outputting the indices of those elements to some serialized data file. I want the indices to be output in strict ascending order, and for there to be no duplicates. Both n and k may be quite large, and it is generally not feasible to simply store entire arrays in memory of that size.
The first algorithm I came up with was to pick a random number r[0] from 1 to n-k... and then pick a successive random numbers r[i] from r[i-1]+1 to n-k+i, only needing to store two entries for 'r' at any one time. However, a fairly simple analysis reveals the the probability for selecting small numbers is inconsistent with what could have been if the entire set was equally distributed. For example, if n was a billion and k was half a billion, the probability of selecting the first entry with the approach I've just described is very tiny (1 in half a billion), where in actuality since half of the entries are being selected, the first should be selected 50% of the time. Even if I use external sorting to sort k random numbers, I would have to discard any duplicates, and try again. As k approaches n, the number of retries would continue to grow, with no guarantee of termination.
I would like to find a O(k) or O(k log k) algorithm to do this, if it is at all possible. The implementation language I will be using is C++11, but descriptions in pseudocode may still be helpful.
Use a random. randrange() function to get a random integer number from the given exclusive range by specifying the increment. For example, random. randrange(0, 10, 2) will return any random number between 0 and 20 (like 0, 2, 4, 6, 8).
The randint() method to generates a whole number (integer). You can use randint(0,50) to generate a random number between 0 and 50. To generate random integers between 0 and 9, you can use the function randrange(min,max) .
If in practice k has the same order of magnitude as n, perhaps very straightforward O(n) algorithm will suffice:
assert(k <= n);
std::uniform_real_distribution rnd;
for (int i = 0; i < n; i++) {
if (rnd(engine) * (n - i) < k) {
std::cout << i << std::endl;
k--;
}
}
It produces all ascending sequences with equal probability.
You can solve this recursively in O(k log k) if you partition in the middle of your range, and randomly sample from the hypergeometric probability distribution to choose how many values lie above and below the middle point (i.e. the values of k for each subsequence), then recurse for each:
int sample_hypergeometric(int n, int K, int N) // samples hypergeometric distribution and
// returns number of "successes" where there are n draws without replacement from
// a population of N with K possible successes.
// Something similar to scipy.stats.hypergeom.rvs in Python.
// In this case, "success" means the selected value lying below the midpoint.
{
std::default_random_engine generator;
std::uniform_real_distribution<double> distribution(0.0,1.0);
int successes = 0;
for(int trial = 0; trial < n; trial++)
{
if((int)(distribution(generator) * N) < K)
{
successes++;
K--;
}
N--;
}
return successes;
}
select_k_from_n(int start, int k, int n)
{
if(k == 0)
return;
if(k == 1)
{
output start + random(1 to n);
return;
}
// find the number of results below the mid-point:
int k1 = sample_hypergeometric(k, n >> 1, n);
select_k_from_n(start, k1, n >> 1);
select_k_from_n(start + (n >> 1), k - k1, n - (n >> 1));
}
Sampling from the binomial distribution could also be used to approximate the hypergeometric distribution with p = (n >> 1) / n, rejecting samples where k1 > (n >> 1).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With