Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate a list of ascending random integers

I have an external collection containing n elements that I want to select some number (k) of them at random, outputting the indices of those elements to some serialized data file. I want the indices to be output in strict ascending order, and for there to be no duplicates. Both n and k may be quite large, and it is generally not feasible to simply store entire arrays in memory of that size.

The first algorithm I came up with was to pick a random number r[0] from 1 to n-k... and then pick a successive random numbers r[i] from r[i-1]+1 to n-k+i, only needing to store two entries for 'r' at any one time. However, a fairly simple analysis reveals the the probability for selecting small numbers is inconsistent with what could have been if the entire set was equally distributed. For example, if n was a billion and k was half a billion, the probability of selecting the first entry with the approach I've just described is very tiny (1 in half a billion), where in actuality since half of the entries are being selected, the first should be selected 50% of the time. Even if I use external sorting to sort k random numbers, I would have to discard any duplicates, and try again. As k approaches n, the number of retries would continue to grow, with no guarantee of termination.

I would like to find a O(k) or O(k log k) algorithm to do this, if it is at all possible. The implementation language I will be using is C++11, but descriptions in pseudocode may still be helpful.

like image 433
markt1964 Avatar asked Jun 07 '16 22:06

markt1964


People also ask

How do you create a random list for specified range with random?

Use a random. randrange() function to get a random integer number from the given exclusive range by specifying the increment. For example, random. randrange(0, 10, 2) will return any random number between 0 and 20 (like 0, 2, 4, 6, 8).

How do you generate random 50 numbers in Python?

The randint() method to generates a whole number (integer). You can use randint(0,50) to generate a random number between 0 and 50. To generate random integers between 0 and 9, you can use the function randrange(min,max) .


2 Answers

If in practice k has the same order of magnitude as n, perhaps very straightforward O(n) algorithm will suffice:

assert(k <= n);
std::uniform_real_distribution rnd;
for (int i = 0; i < n; i++) {
    if (rnd(engine) * (n - i) < k) {
        std::cout << i << std::endl;
        k--;
    }
}

It produces all ascending sequences with equal probability.

like image 170
Vlad Shcherbina Avatar answered Sep 26 '22 01:09

Vlad Shcherbina


You can solve this recursively in O(k log k) if you partition in the middle of your range, and randomly sample from the hypergeometric probability distribution to choose how many values lie above and below the middle point (i.e. the values of k for each subsequence), then recurse for each:

int sample_hypergeometric(int n, int K, int N) // samples hypergeometric distribution and
// returns number of "successes" where there are n draws without replacement from
// a population of N with K possible successes.
// Something similar to scipy.stats.hypergeom.rvs in Python.
// In this case, "success" means the selected value lying below the midpoint. 
{
     std::default_random_engine generator;
     std::uniform_real_distribution<double> distribution(0.0,1.0);

     int successes = 0;
     for(int trial = 0; trial < n; trial++)
     {
         if((int)(distribution(generator) * N) < K)
         {
             successes++;
             K--;
         }
         N--;
     }
     return successes;
}

select_k_from_n(int start, int k, int n)
{
    if(k == 0)
        return;
    if(k == 1)
    {
        output start + random(1 to n);
        return;
    }

    // find the number of results below the mid-point:
    int k1 = sample_hypergeometric(k, n >> 1, n);
    select_k_from_n(start, k1, n >> 1);
    select_k_from_n(start + (n >> 1), k - k1, n - (n >> 1));
} 

Sampling from the binomial distribution could also be used to approximate the hypergeometric distribution with p = (n >> 1) / n, rejecting samples where k1 > (n >> 1).

like image 25
samgak Avatar answered Sep 24 '22 01:09

samgak