I am using a base-conversion algorithm to generate a permutation from a large integer (split into 32-bit words).
I use a relatively standard algorithm for this:
/* N = count,K is permutation index (0..N!-1) A[N] contains 0..N-1 */
i = 0;
while (N > 1) {
swap A[i] and A[i+(k%N)]
k = k / N
N = N - 1
i = i + 1
}
Unfortunately, the divide and modulo each iteration adds up, especially moving to large integers - But, it seems I could just use multiply!
/* As before, N is count, K is index, A[N] contains 0..N-1 */
/* Split is arbitrarily 128 (bits), for my current choice of N */
/* "Adjust" is precalculated: (1 << Split)/(N!) */
a = k*Adjust; /* a can be treated as a fixed point fraction */
i = 0;
while (N > 1) {
a = a*N;
index = a >> Split;
a = a & ((1 << Split) - 1); /* actually, just zeroing a register */
swap A[i] and A[i+index]
N = N - 1
i = i + 1
}
This is nicer, but doing large integer multiplies is still sluggish.
Question 1:
Is there a way of doing this faster?
Eg. Since I know that N*(N-1) is less than 2^32, could I pull out those numbers from one word, and merge in the 'leftovers'?
Or, is there a way to modify an arithetic decoder to pull out the indicies one at a time?
Question 2:
For the sake of curiosity - if I use multiplication to convert a number to base 10 without the adjustment, then the result is multiplied by (10^digits/2^shift). Is there a tricky way to remove this factor working with the decimal digits? Even with the adjustment factor, this seems like it would be faster -- why wouldn't standard libraries use this vs divide and mod?
Seeing that you are talking about numbers like 2^128/(N!), it seems that in your problem N is going to be rather small (N < 35 according to my calculations). I suggest taking the original algorithm as a starting point; first switch the direction of the loop:
i = 2;
while (i < N) {
swap A[N - 1 - i] and A[N - i + k % i]
k = k / i
i = i + 1
}
Now change the loop to do several permutations per iteration. I guess the speed of division is the same regardless of the number i, as long as i < 2^32.
Split the range 2...N-1 into sub-ranges so that the product of the numbers in each sub-range is less than 2^32:
2, 3, 4, ..., 12: product is 479001600
13, 14, ..., 19: product is 253955520
20, 21, ..., 26: product is 3315312000
27, 28, ..., 32: product is 652458240
33, 34, 35: product is 39270
Then, divide the long number k by the products instead of dividing by i. Each iteration will yield a remainder (less than 2^32) and a smaller number k. When you have the remainder, you can work with it in an inner loop using the original algorithm; which will now be faster because it doesn't involve long division.
Here is some code:
static const int rangeCount = 5;
static const int rangeLimit[rangeCount] = {13, 20, 27, 33, 36};
static uint32_t rangeProduct[rangeCount] = {
479001600,
253955520,
3315312000,
652458240,
39270
};
for (int rangeIndex = 0; rangeIndex < rangeCount; ++rangeIndex)
{
// The following two lines involve long division;
// math libraries probably calculate both quotient and remainder
// in one function call
uint32_t rangeRemainder = k % rangeProduct[rangeIndex];
k /= rangeProduct[rangeIndex];
// A range starts where the previous range ended
int rangeStart = (rangeIndex == 0) ? 2 : rangeLimit[rangeIndex - 1];
// Iterate over range
for (int i = rangeStart; i < rangeLimit[rangeIndex] && i < n; ++i)
{
// The following two lines involve a 32-bit division;
// it produces both quotient and remainder in one Pentium instruction
int remainder = rangeRemainder % i;
rangeRemainder /= i;
std::swap(permutation[n - 1 - i], permutation[n - i + remainder]);
}
}
Of course, this code can be extended into more than 128 bits.
Another optimization could involve extraction of powers of 2 from the products of ranges; this might add a slight speedup by making the ranges longer. Not sure whether this is worthwhile (maybe for large values of N, like N=1000).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With