I'm currently benchmarking some data structures in C++ and I want to test them when working on Zipf-distributed numbers.
I'm using the generator provided on this site: http://www.cse.usf.edu/~christen/tools/toolpage.html
I adapted the implementation to use a Mersenne Twister generator.
It works well but it is really slow. In my case, the range can be big (about a million) and the number of random numbers generate can be several millions.
The alpha parameter does not change over time, it is fixed.
I tried to precaculate all the sum_prob. It's much faster, but still slows on big range.
Is there a faster way to generate Zipf distributed numbers ? Even something less precise will be welcome.
Thanks
The pre-calculation alone does not help so much. But as it's obvious the sum_prob is accumulative and has ascending order. So if we use a binary-search to find the zipf_value we would decrease the order of generating a Zipf distributed number from O(n) to O(log(n)). Which is so much improvement in efficiency.
Here it is, just replace the zipf()
function in genzipf.c
with following one:
int zipf(double alpha, int n)
{
static int first = TRUE; // Static first time flag
static double c = 0; // Normalization constant
static double *sum_probs; // Pre-calculated sum of probabilities
double z; // Uniform random number (0 < z < 1)
int zipf_value; // Computed exponential value to be returned
int i; // Loop counter
int low, high, mid; // Binary-search bounds
// Compute normalization constant on first call only
if (first == TRUE)
{
for (i=1; i<=n; i++)
c = c + (1.0 / pow((double) i, alpha));
c = 1.0 / c;
sum_probs = malloc((n+1)*sizeof(*sum_probs));
sum_probs[0] = 0;
for (i=1; i<=n; i++) {
sum_probs[i] = sum_probs[i-1] + c / pow((double) i, alpha);
}
first = FALSE;
}
// Pull a uniform random number (0 < z < 1)
do
{
z = rand_val(0);
}
while ((z == 0) || (z == 1));
// Map z to the value
low = 1, high = n, mid;
do {
mid = floor((low+high)/2);
if (sum_probs[mid] >= z && sum_probs[mid-1] < z) {
zipf_value = mid;
break;
} else if (sum_probs[mid] >= z) {
high = mid-1;
} else {
low = mid+1;
}
} while (low <= high);
// Assert that zipf_value is between 1 and N
assert((zipf_value >=1) && (zipf_value <= n));
return(zipf_value);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With