Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate Zipf distributed numbers efficiently?

Tags:

c++

random

I'm currently benchmarking some data structures in C++ and I want to test them when working on Zipf-distributed numbers.

I'm using the generator provided on this site: http://www.cse.usf.edu/~christen/tools/toolpage.html

I adapted the implementation to use a Mersenne Twister generator.

It works well but it is really slow. In my case, the range can be big (about a million) and the number of random numbers generate can be several millions.

The alpha parameter does not change over time, it is fixed.

I tried to precaculate all the sum_prob. It's much faster, but still slows on big range.

Is there a faster way to generate Zipf distributed numbers ? Even something less precise will be welcome.

Thanks

like image 617
Baptiste Wicht Avatar asked Apr 02 '12 20:04

Baptiste Wicht


1 Answers

The pre-calculation alone does not help so much. But as it's obvious the sum_prob is accumulative and has ascending order. So if we use a binary-search to find the zipf_value we would decrease the order of generating a Zipf distributed number from O(n) to O(log(n)). Which is so much improvement in efficiency.

Here it is, just replace the zipf() function in genzipf.c with following one:

int zipf(double alpha, int n)
{
  static int first = TRUE;      // Static first time flag
  static double c = 0;          // Normalization constant
  static double *sum_probs;     // Pre-calculated sum of probabilities
  double z;                     // Uniform random number (0 < z < 1)
  int zipf_value;               // Computed exponential value to be returned
  int    i;                     // Loop counter
  int low, high, mid;           // Binary-search bounds

  // Compute normalization constant on first call only
  if (first == TRUE)
  {
    for (i=1; i<=n; i++)
      c = c + (1.0 / pow((double) i, alpha));
    c = 1.0 / c;

    sum_probs = malloc((n+1)*sizeof(*sum_probs));
    sum_probs[0] = 0;
    for (i=1; i<=n; i++) {
      sum_probs[i] = sum_probs[i-1] + c / pow((double) i, alpha);
    }
    first = FALSE;
  }

  // Pull a uniform random number (0 < z < 1)
  do
  {
    z = rand_val(0);
  }
  while ((z == 0) || (z == 1));

  // Map z to the value
  low = 1, high = n, mid;
  do {
    mid = floor((low+high)/2);
    if (sum_probs[mid] >= z && sum_probs[mid-1] < z) {
      zipf_value = mid;
      break;
    } else if (sum_probs[mid] >= z) {
      high = mid-1;
    } else {
      low = mid+1;
    }
  } while (low <= high);

  // Assert that zipf_value is between 1 and N
  assert((zipf_value >=1) && (zipf_value <= n));

  return(zipf_value);
}
like image 83
Masoud Kazemi Avatar answered Sep 20 '22 13:09

Masoud Kazemi