I'd like to generate a large number, <code>n</code>, (i.e., <code>n >= 1,000,000,000</code>) of sorted and uniformly distributed random numbers in C++. A first and simple approach I considered was to <ol> <li>sequentially generate <code>n</code> uniformly distributed numbers using an <code>std::uniform_real_distribution<double></code>,</li> <li>and then sort them using <code>std::sort</code>.</li> </ol> However, this takes several minutes. A second and more sophisticated approach was to do parallelize the two steps as in: <pre class="prettyprint"><code>template <typename T> void computeUniformDistribution(std::vector<T>& elements) { #pragma omp parallel { std::seed_seq seed{distribution_seed, static_cast<size_t>(omp_get_thread_num())}; std::mt19937 prng = std::mt19937(seed); std::uniform_real_distribution<double> uniform_dist(0, std::numeric_limits<T>::max()); #pragma omp for for (size_t i = 0; i < elements.size(); ++i) { elements[i] = static_cast<T>(uniform_dist(prng)); } } std::sort(std::execution::par_unseq, elements.begin(), elements.end()); } </code></pre> However, even this takes about about 30 seconds. Given that the generation of the uniformly distributed numbers takes only about 1.5 seconds, the bottleneck is still the sort phase. Hence, I'd like to ask the following question: How can I efficiently generate uniformly distributed data in a sorted way?

There are ways to generate samples that are already sorted, but I think that it might be better to generate partially sorted samples. Divide the output range into k buckets of equal width. The number of samples in each bucket will have multinomial distribution with equal probabilities. The slow method to sample the multinomial distribution is to generate n integers in [0, k). A more efficient method is to draw k Poisson samples with rate n/k conditioned on their sum not exceeding n, then add another n - sum samples using the slow way. Sampling the Poisson distribution is tricky to do perfectly, but when n/k is very large (as it will be here), the Poisson distribution is excellently approximated by rounding a normal distribution with mean and variance n/k. If that's unacceptable, the slow method does parallelize well. Given the bucket counts, compute the prefix sums to find the bucket boundaries. For each bucket in parallel, generate the given number of samples within the bucketed range and sort them. If we choose n/k well, each bucket will almost certainly fit in L1 cache. For n = 1e9, I think I'd try k = 1e5 or k = 1e6. Here's a sequential implementation. A little unpolished since we really need to avoid 2x oversampling the bucket boundaries, which are closed, but I'll leave that to you. I'm not familiar with OMP, but I think you can get a pretty good parallel implementation by adding a pragma to the for loop at the end of <code>SortedUniformSamples</code>. <pre class="prettyprint"><code>#include <algorithm> #include <cmath> #include <iostream> #include <numeric> #include <random> #include #include <vector> template <typename Dist, typename Gen> void SortedSamples(std::span<double> samples, Dist dist, Gen& gen) { for (double& sample : samples) { sample = dist(gen); } std::sort(samples.begin(), samples.end()); } template <typename Gen> void ApproxMultinomialSample(std::span<std::size_t> samples, std::size_t n, Gen& gen) { double lambda = static_cast<double>(n) / samples.size(); std::normal_distribution<double> approx_poisson{lambda, std::sqrt(lambda)}; std::size_t sum; do { for (std::size_t& sample : samples) { sample = std::lrint(approx_poisson(gen)); } sum = std::accumulate(samples.begin(), samples.end(), std::size_t{0}); } while (sum > n); std::uniform_int_distribution<std::size_t> uniform{0, samples.size() - 1}; for (; sum < n; sum++) { samples[uniform(gen)]++; } } template <typename Gen> void SortedUniformSamples(std::span<double> samples, Gen& gen) { static constexpr std::size_t kTargetBucketSize = 1024; if (samples.size() < kTargetBucketSize) { SortedSamples(samples, std::uniform_real_distribution<double>{0, 1}, gen); return; } std::size_t num_buckets = samples.size() / kTargetBucketSize; std::vector<std::size_t> bucket_counts(num_buckets); ApproxMultinomialSample(bucket_counts, samples.size(), gen); std::vector<std::size_t> prefix_sums(num_buckets + 1); std::partial_sum(bucket_counts.begin(), bucket_counts.end(), ++prefix_sums.begin()); for (std::size_t i = 0; i < num_buckets; i++) { SortedSamples(std::span<double>{&samples[prefix_sums[i]], &samples[prefix_sums[i + 1]]}, std::uniform_real_distribution<double>{ static_cast<double>(i) / num_buckets, static_cast<double>(i + 1) / num_buckets}, gen); } } int main() { std::vector<double> samples(100000000); std::default_random_engine gen; SortedUniformSamples(samples, gen); if (std::is_sorted(samples.begin(), samples.end())) { std::cout << "sorted\n"; } } </code></pre> If your standard library has a high-quality implementation of <code>poisson_distribution</code>, you could also do this: <pre class="prettyprint"><code>template <typename Gen> void MultinomialSample(std::span<std::size_t> samples, std::size_t n, Gen& gen) { double lambda = static_cast<double>(n) / samples.size(); std::poisson_distribution<std::size_t> poisson{lambda}; std::size_t sum; do { for (std::size_t& sample : samples) { sample = poisson(gen); } sum = std::accumulate(samples.begin(), samples.end(), std::size_t{0}); } while (sum > n); std::uniform_int_distribution<std::size_t> uniform{0, samples.size() - 1}; for (; sum < n; sum++) { samples[uniform(gen)]++; } } </code></pre>

I'd be tempted to rely on the fact that the difference between consecutive elements of a sorted set of uniformly distributed variables are exponentially distributed. This can be exploited to run in <code>O(N)</code> time rather than <code>O(N*log N)</code>. A quick implementation would do something like: <pre class="prettyprint"><code>template<typename T> void computeSorteUniform2(std::vector<T>& elements) { std::random_device rd; std::mt19937 prng(rd()); std::exponential_distribution<T> dist(static_cast<T>(1)); auto sum = dist(prng); for (auto& elem : elements) { elem = sum += dist(prng); } sum += dist(prng); for (auto& elem : elements) { elem /= sum; } } </code></pre> this example is simplified by assuming you want values in Uniform(0, 1), but it should be easy to generalise. Making this work using OMP isn't quite trivial, but shouldn't be too hard. If you care about the last ~50% performance there are some numeric tricks that might speed up generating random deviates (e.g. there are faster and better PRNGs than the MT) as well as converting them to <code>double</code>s (but recent compilers might know about these tricks). A couple of references: Daniel Lemire's blog and Melissa O'Neill's PCG site. I've just benchmarked this and discovered that clang's <code>std::uniform_real_distribution</code> and <code>std::exponential_distribution</code> are both very slow. <code>numpy</code>'s Ziggurat based implementations are 8 times faster, such that I can generate 1e9 <code>double</code>'s in ~10 seconds using a single thread on my laptop (i.e. <code>std</code> implementations take ~80 seconds) using the above algorithm. I've not tried OP's implementation on 1e9 elements, but with 1e8 elements mine is ~15 times faster.

I ran some tests and radix sort was 4 to 6 times as fast as std::sort depending on the system, but it requires a second vector, and for 1 GB of elements, each vector of doubles is 8 GB, for a total of 16 GB of available memory, so you would probably need 32 GB of RAM. A multi-threading radix sort may help if the sort is not memory bandwidth limited. Example single threaded code: <pre class="prettyprint"><code>#include <algorithm> #include <iostream> #include <random> #include <vector> #include <time.h> clock_t ctTimeStart; // clock values clock_t ctTimeStop; typedef unsigned long long uint64_t; // a is input array, b is working array uint64_t * RadixSort(uint64_t * a, uint64_t *b, size_t count) { uint32_t mIndex[8][256] = {0}; // count / index matrix uint32_t i,j,m,n; uint64_t u; for(i = 0; i < count; i++){ // generate histograms u = a[i]; for(j = 0; j < 8; j++){ mIndex[j][(size_t)(u & 0xff)]++; u >>= 8; } } for(j = 0; j < 8; j++){ // convert to indices m = 0; for(i = 0; i < 256; i++){ n = mIndex[j][i]; mIndex[j][i] = m; m += n; } } for(j = 0; j < 8; j++){ // radix sort for(i = 0; i < count; i++){ // sort by current LSB u = a[i]; m = (size_t)(u>>(j<<3))&0xff; b[mIndex[j][m]++] = u; } std::swap(a, b); // swap ptrs } return(a); } #define COUNT (1024*1024*1024) int main(int argc, char**argv) { std::vector<double> v(COUNT); // vctr to be generated std::vector<double> t(COUNT); // temp vector std::random_device rd; std::mt19937 gen(rd()); // std::uniform_real_distribution<> dis(0, std::numeric_limits<double>::max()); std::uniform_real_distribution<> dis(0, COUNT); ctTimeStart = clock(); for(size_t i = 0; i < v.size(); i++) v[i] = dis(gen); ctTimeStop = clock(); std::cout << "# of ticks " << ctTimeStop - ctTimeStart << std::endl; ctTimeStart = clock(); // std::sort(v.begin(), v.end()); RadixSort((uint64_t *)&v[0], (uint64_t *)&t[0], COUNT); ctTimeStop = clock(); std::cout << "# of ticks " << ctTimeStop - ctTimeStart << std::endl; return(0); } </code></pre> <hr> If sorting doubles (cast to 64 bit unsigned integers) that include negative values you'll need to treat them as sign + magnitude 64 bit integers. C++ macros used to convert sign + magnitude (SM) to/from 64 bit unsigned integers (ULL): <pre class="prettyprint"><code>// converting doubles to unsigned long long for radix sort or something similar // note -0 converted to 0x7fffffffffffffff, +0 converted to 0x8000000000000000 // -0 is unlikely to be produced by a float operation #define SM2ULL(x) ((x)^(((~(x) >> 63)-1) | 0x8000000000000000ull)) #define ULL2SM(x) ((x)^((( (x) >> 63)-1) | 0x8000000000000000ull)) </code></pre>

There is a simple observation involving sorted uniform random numbers in [0, 1]: <ol> <li>Each uniform [0, 1] number is equally likely to be less than half or greater than half. Thus, the number of uniform [0, 1] numbers that are less than half vs. greater than half follows a binomial(n, 1/2) distribution.</li> <li>Of the numbers less than half, each number is as likely to be less than 1/4 as it is to be greater than 1/4, so that the less-than-1/4 vs. greater-than-1/4 numbers follow the same distribution.</li> <li>And so on.</li> </ol> Thus, each number can be generated one bit at a time, from left to right after the binary point. Here is a sketch of how this works to generate n sorted uniform random numbers: <ol> <li>If n is 0 or 1, stop. Otherwise, generate b, a binomial(n, 1/2) random number.</li> <li>Append 0 to the first b random numbers and 1 to the rest.</li> <li>Run this algorithm recursively on the first b numbers, but with n = b.</li> <li>Run this algorithm recursively on the rest of the numbers, but with n = n − b.</li> </ol> At this point, we have a sorted list of random numbers with varying bit counts. All that is left to do is to fill each number with uniform random bits as needed (or chop off or round excess bits) to give the number p bits (for example 53 bits for double precision). Then, divide each number by 2p. I give a similar algorithm to find the k-th smallest out of n random numbers.

How can I generate sorted uniformly distributed random numbers efficiently in C++?

Tags:

c++

algorithm

sorting

random

c++17

I'd like to generate a large number, n, (i.e., n >= 1,000,000,000) of sorted and uniformly distributed random numbers in C++.

A first and simple approach I considered was to

sequentially generate n uniformly distributed numbers using an std::uniform_real_distribution<double>,
and then sort them using std::sort.

However, this takes several minutes.

A second and more sophisticated approach was to do parallelize the two steps as in:

template <typename T>
void computeUniformDistribution(std::vector<T>& elements)
{
    #pragma omp parallel
    {
        std::seed_seq seed{distribution_seed, static_cast<size_t>(omp_get_thread_num())};
        std::mt19937 prng = std::mt19937(seed);
        std::uniform_real_distribution<double> uniform_dist(0, std::numeric_limits<T>::max());

        #pragma omp for
        for (size_t i = 0; i < elements.size(); ++i)
        {
            elements[i] = static_cast<T>(uniform_dist(prng));
        }
    }

    std::sort(std::execution::par_unseq, elements.begin(), elements.end());
}

However, even this takes about about 30 seconds. Given that the generation of the uniformly distributed numbers takes only about 1.5 seconds, the bottleneck is still the sort phase.

Hence, I'd like to ask the following question: How can I efficiently generate uniformly distributed data in a sorted way?

820

asked Aug 15 '20 10:08

epic-skyrise-tm

Video Answer

4 Answers

There are ways to generate samples that are already sorted, but I think that it might be better to generate partially sorted samples.

Divide the output range into k buckets of equal width. The number of samples in each bucket will have multinomial distribution with equal probabilities. The slow method to sample the multinomial distribution is to generate n integers in [0, k). A more efficient method is to draw k Poisson samples with rate n/k conditioned on their sum not exceeding n, then add another n - sum samples using the slow way. Sampling the Poisson distribution is tricky to do perfectly, but when n/k is very large (as it will be here), the Poisson distribution is excellently approximated by rounding a normal distribution with mean and variance n/k. If that's unacceptable, the slow method does parallelize well.

Given the bucket counts, compute the prefix sums to find the bucket boundaries. For each bucket in parallel, generate the given number of samples within the bucketed range and sort them. If we choose n/k well, each bucket will almost certainly fit in L1 cache. For n = 1e9, I think I'd try k = 1e5 or k = 1e6.

Here's a sequential implementation. A little unpolished since we really need to avoid 2x oversampling the bucket boundaries, which are closed, but I'll leave that to you. I'm not familiar with OMP, but I think you can get a pretty good parallel implementation by adding a pragma to the for loop at the end of SortedUniformSamples.

#include <algorithm>
#include <cmath>
#include <iostream>
#include <numeric>
#include <random>
#include <span>
#include <vector>

template <typename Dist, typename Gen>
void SortedSamples(std::span<double> samples, Dist dist, Gen& gen) {
  for (double& sample : samples) {
    sample = dist(gen);
  }
  std::sort(samples.begin(), samples.end());
}

template <typename Gen>
void ApproxMultinomialSample(std::span<std::size_t> samples, std::size_t n,
                             Gen& gen) {
  double lambda = static_cast<double>(n) / samples.size();
  std::normal_distribution<double> approx_poisson{lambda, std::sqrt(lambda)};
  std::size_t sum;
  do {
    for (std::size_t& sample : samples) {
      sample = std::lrint(approx_poisson(gen));
    }
    sum = std::accumulate(samples.begin(), samples.end(), std::size_t{0});
  } while (sum > n);
  std::uniform_int_distribution<std::size_t> uniform{0, samples.size() - 1};
  for (; sum < n; sum++) {
    samples[uniform(gen)]++;
  }
}

template <typename Gen>
void SortedUniformSamples(std::span<double> samples, Gen& gen) {
  static constexpr std::size_t kTargetBucketSize = 1024;
  if (samples.size() < kTargetBucketSize) {
    SortedSamples(samples, std::uniform_real_distribution<double>{0, 1}, gen);
    return;
  }
  std::size_t num_buckets = samples.size() / kTargetBucketSize;
  std::vector<std::size_t> bucket_counts(num_buckets);
  ApproxMultinomialSample(bucket_counts, samples.size(), gen);
  std::vector<std::size_t> prefix_sums(num_buckets + 1);
  std::partial_sum(bucket_counts.begin(), bucket_counts.end(),
                   ++prefix_sums.begin());
  for (std::size_t i = 0; i < num_buckets; i++) {
    SortedSamples(std::span<double>{&samples[prefix_sums[i]],
                                    &samples[prefix_sums[i + 1]]},
                  std::uniform_real_distribution<double>{
                      static_cast<double>(i) / num_buckets,
                      static_cast<double>(i + 1) / num_buckets},
                  gen);
  }
}

int main() {
  std::vector<double> samples(100000000);
  std::default_random_engine gen;
  SortedUniformSamples(samples, gen);
  if (std::is_sorted(samples.begin(), samples.end())) {
    std::cout << "sorted\n";
  }
}

If your standard library has a high-quality implementation of poisson_distribution, you could also do this:

template <typename Gen>
void MultinomialSample(std::span<std::size_t> samples, std::size_t n,
                       Gen& gen) {
  double lambda = static_cast<double>(n) / samples.size();
  std::poisson_distribution<std::size_t> poisson{lambda};
  std::size_t sum;
  do {
    for (std::size_t& sample : samples) {
      sample = poisson(gen);
    }
    sum = std::accumulate(samples.begin(), samples.end(), std::size_t{0});
  } while (sum > n);
  std::uniform_int_distribution<std::size_t> uniform{0, samples.size() - 1};
  for (; sum < n; sum++) {
    samples[uniform(gen)]++;
  }
}

174

answered Oct 21 '22 02:10

David Eisenstat

I'd be tempted to rely on the fact that the difference between consecutive elements of a sorted set of uniformly distributed variables are exponentially distributed. This can be exploited to run in O(N) time rather than O(N*log N).

A quick implementation would do something like:

template<typename T> void
computeSorteUniform2(std::vector<T>& elements)
{
    std::random_device rd;
    std::mt19937 prng(rd());

    std::exponential_distribution<T> dist(static_cast<T>(1));

    auto sum = dist(prng);

    for (auto& elem : elements) {
        elem = sum += dist(prng);
    }

    sum += dist(prng);

    for (auto& elem : elements) {
        elem /= sum;
    }
}

this example is simplified by assuming you want values in Uniform(0, 1), but it should be easy to generalise. Making this work using OMP isn't quite trivial, but shouldn't be too hard.

If you care about the last ~50% performance there are some numeric tricks that might speed up generating random deviates (e.g. there are faster and better PRNGs than the MT) as well as converting them to doubles (but recent compilers might know about these tricks). A couple of references: Daniel Lemire's blog and Melissa O'Neill's PCG site.

I've just benchmarked this and discovered that clang's std::uniform_real_distribution and std::exponential_distribution are both very slow. numpy's Ziggurat based implementations are 8 times faster, such that I can generate 1e9 double's in ~10 seconds using a single thread on my laptop (i.e. std implementations take ~80 seconds) using the above algorithm. I've not tried OP's implementation on 1e9 elements, but with 1e8 elements mine is ~15 times faster.

answered Oct 21 '22 04:10

Sam Mason

I ran some tests and radix sort was 4 to 6 times as fast as std::sort depending on the system, but it requires a second vector, and for 1 GB of elements, each vector of doubles is 8 GB, for a total of 16 GB of available memory, so you would probably need 32 GB of RAM.

A multi-threading radix sort may help if the sort is not memory bandwidth limited.

Example single threaded code:

#include <algorithm>
#include <iostream>
#include <random>
#include <vector>
#include <time.h>

clock_t ctTimeStart;            // clock values
clock_t ctTimeStop;

typedef unsigned long long uint64_t;

//  a is input array, b is working array
uint64_t * RadixSort(uint64_t * a, uint64_t *b, size_t count)
{
uint32_t mIndex[8][256] = {0};          // count / index matrix
uint32_t i,j,m,n;
uint64_t u;
    for(i = 0; i < count; i++){         // generate histograms
        u = a[i];
        for(j = 0; j < 8; j++){
            mIndex[j][(size_t)(u & 0xff)]++;
            u >>= 8;
        }
    }
    for(j = 0; j < 8; j++){             // convert to indices
        m = 0;
        for(i = 0; i < 256; i++){
            n = mIndex[j][i];
            mIndex[j][i] = m;
            m += n;
        }
    }
    for(j = 0; j < 8; j++){             // radix sort
        for(i = 0; i < count; i++){     //  sort by current LSB
            u = a[i];
            m = (size_t)(u>>(j<<3))&0xff;
            b[mIndex[j][m]++] = u;
        }
        std::swap(a, b);                //  swap ptrs
    }
    return(a);
}

#define COUNT (1024*1024*1024)

int main(int argc, char**argv)
{
    std::vector<double> v(COUNT);       // vctr to be generated
    std::vector<double> t(COUNT);       // temp vector
    std::random_device rd;
    std::mt19937 gen(rd());
//  std::uniform_real_distribution<> dis(0, std::numeric_limits<double>::max());
    std::uniform_real_distribution<> dis(0, COUNT);
    ctTimeStart = clock();
    for(size_t i = 0; i < v.size(); i++)
        v[i] = dis(gen);
    ctTimeStop = clock();
    std::cout << "# of ticks " << ctTimeStop - ctTimeStart << std::endl;
    ctTimeStart = clock();
//  std::sort(v.begin(), v.end());
    RadixSort((uint64_t *)&v[0], (uint64_t *)&t[0], COUNT);
    ctTimeStop = clock();
    std::cout << "# of ticks " << ctTimeStop - ctTimeStart << std::endl;
    return(0);
}

If sorting doubles (cast to 64 bit unsigned integers) that include negative values you'll need to treat them as sign + magnitude 64 bit integers. C++ macros used to convert sign + magnitude (SM) to/from 64 bit unsigned integers (ULL):

// converting doubles to unsigned long long for radix sort or something similar
// note -0 converted to 0x7fffffffffffffff, +0 converted to 0x8000000000000000
// -0 is unlikely to be produced by a float operation

#define SM2ULL(x) ((x)^(((~(x) >> 63)-1) | 0x8000000000000000ull))
#define ULL2SM(x) ((x)^((( (x) >> 63)-1) | 0x8000000000000000ull))

answered Oct 21 '22 03:10

rcgldr

There is a simple observation involving sorted uniform random numbers in [0, 1]:

Each uniform [0, 1] number is equally likely to be less than half or greater than half. Thus, the number of uniform [0, 1] numbers that are less than half vs. greater than half follows a binomial(n, 1/2) distribution.
Of the numbers less than half, each number is as likely to be less than 1/4 as it is to be greater than 1/4, so that the less-than-1/4 vs. greater-than-1/4 numbers follow the same distribution.
And so on.

Thus, each number can be generated one bit at a time, from left to right after the binary point. Here is a sketch of how this works to generate n sorted uniform random numbers:

If n is 0 or 1, stop. Otherwise, generate b, a binomial(n, 1/2) random number.
Append 0 to the first b random numbers and 1 to the rest.
Run this algorithm recursively on the first b numbers, but with n = b.
Run this algorithm recursively on the rest of the numbers, but with n = n − b.

At this point, we have a sorted list of random numbers with varying bit counts. All that is left to do is to fill each number with uniform random bits as needed (or chop off or round excess bits) to give the number p bits (for example 53 bits for double precision). Then, divide each number by 2^p.

I give a similar algorithm to find the k-th smallest out of n random numbers.