I got hold on an SUPER-FAST algorithm that generates an array of random bytes, uniformly. It's 6 times faster than c++ uniform distribution and mersenne-twister of std library.
The count of an array is divisible by 4, so it can be interpreted as array of integers. Casting each entry to an integer, produces values in the range [INT_MIN, INT_MAX]
. But how can I transform these integer values to lie between my own [min, maximum]
?
I want to avoid any if-else, to avoid branching.
Maybe I should apply some bitwise logic, to discard irrelevant bits in each number? (because all remaining, unmasked bits will be either 0 or 1 anyway). If I can extract the most significant bit in my maximum-value, I could mask any bits that are more significant than that one, in my integers.
For example, if I want my max
to be 17, then it is 00010001
in binary form. Maybe my mask would then look as 00011111
? I could then apply it to all numbers in my array.
But, this mask is wrong ...It actually allows values up to (1+2+4+8+16)
:(
What can I do? Also, how to take care of the min
?
Edit
I am generating millions of numbers every frame of my application, for neural networks. I managed to vectorize the code using AXV2 for float variants (using this post), but need to get integers working too.
But how can I transform these integer values to lie between my own
[min, maximum]
?
Since the range may not be a power of two, bitmasking is out, but you found that out already.
Modulo is also out, it does not exist as a native operation in AVX2 (and even if it did, that wouldn't necessarily make it efficient).
There is an other option: multiply-high, using _mm256_mul_epu32
(unfortunately there is no "pure" multiply-high for 32bit numbers, like there is for 16bit numbers, so we're stuck with an operation that only does 50% useful work). The idea there is to take the input number x
(full range) and the desired range r
, then compute r * x / 2^32
where the division is implicit (implemented by taking the high half of the product).
x / 2^32
would have been a number in [0.0 .. 1.0) (excluding 1.0) if it was interpreted as a rational number, multiplying by r
then stretches the range to be [0.0 .. r
) (excluding r
). That's not how it's calculated, but that's where the formula comes from.
Setting the minimum of the range is handled easily by adding min
to the scaled result.
In code (slightly tested):
__m256i squish(__m256i x, int min, int max) {
__m256i sizeOfRange = _mm256_set1_epi32((unsigned)max - min);
__m256i scaled_even = _mm256_shuffle_epi32(_mm256_mul_epu32(x, sizeOfRange), 0xB1);
__m256i scaled_odd = _mm256_mul_epu32(_mm256_shuffle_epi32(x, 0xB1), sizeOfRange);
__m256i scaled = _mm256_blend_epi32(scaled_even, scaled_odd, 0xAA);
return _mm256_add_epi32(scaled, _mm256_set1_epi32(min));
}
It's still an exclusive range, it cannot handle the full [INT_MIN .. INT_MAX]
as output range. There is no way to even specify it, the most it can do is [INT_MIN .. INT_MAX)
(or for example an equivalent range with zero offset: [0 .. -1)
).
It's also not really uniform, for the same reason that the simple modulo-based range reduction isn't really uniform, you just cannot fairly divide N
marbles over K
bins unless K
happens to divide N
evenly.
The core idea is to use modulo instead of bitwise masks, which are useless in non-power-of-2 case. No branching is also a bit weird requirement. What you want is "fast enough", not "no branching and bitwise masks".
So assume that we have a function
int rand();
that produces a random integer uniformly. If max
is of the form 2^n-1
then the following
rand() % (max+1)
will uniformly produce a random integer in the range [0,max]
. That's because the total number of integers is a power of 2.
Now if min
and max
are such that max-min
is of the form 2^n-1
then the following
(rand() % (max-min+1)) + min
will uniformly produce a random integer in the range [min, max]
.
But what happens when max-min
is not of the form 2^n-1
? Then we are out of luck. The (rand() % (max-min+1)) + min
method will still produce a random integer in [min, max]
range but no longer uniformly. Why is that? Because when n
is fixed and not a power of 2, then the total number of integers that give concrete r = x % n
result varies depending on r
.
However the method is not bad. The bigger max-min
value is the closer it gets to uniform distribution and often it is good enough in practice. And it is very fast, no branching.
Another example is
upper = get_upper_power_of_2(max - min)
do
{
tmp = rand() % upper;
} while (tmp > max - min);
result = tmp + min;
This method has nice property that it is uniform, but it has no stop property, i.e. theoretically this algorithm may never stop. It also has branching. But in practice it does stop very fast (with a huge probability) and so it is quite common algorithm. For example it is in the standard Java library.
Both methods of course have an issue when max-min
overflows (i.e. when min
is a big negative number) which can be fixed if we switch to unsigned integers and then back to integers.
As far as I know there is no algorithm that generates a random integer in [0, max]
when max
is not of the form 2^n-1
from 01
uniform generator such that the results are uniform and it has stop property. I think that no such algorithm can exist, but I failed to find the appropriate result in computer science.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With