I am interested in getting the remainder of the Euclidean division, that is, for a pair of integers (i, n), find r such as:
i = k * n + r, 0 <= r < |k|
the simple solution is:
int euc(int i, int n)
{
int r;
r = i % n;
if ( r < 0) {
r += n;
}
return r;
}
But since I need to execute this tens of million of times (it is used inside an iterator for multidimensional arrays), I would like to avoid the branching if possible. Requirements:
It is actually quite easy to get the result wrong, so here is an example of the expected results:
Some people also worry that it does not make sense to optimize this. I need this for an multi-dimensional iterator where out of bounds items are replaced by items in a 'virtual array' which repeats the original array. So if my array x is [1, 2, 3, 4], the virtual array is [...., 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4], and for example, x[-2] is x1, etc...
For a nd array of dimension d, I need d Euclidean division for every point. If I need to do a correlation between a n^d array with a m^d kernel, I need n^d * m^d * d euclidean divisions. For a 3d image of 100x100x100 points and a kernel of 5*5*5 points, that's already ~ 400 million Euclidean divisions.
Edit: No multiplication or branches woot.
int euc(int i, int n)
{
int r;
r = i % n;
r += n & (-(r < 0));
return r;
}
Here is the generated code. According to the MSVC++ instrumenting profiler (my testing) and the OP's testing, they perform nearly the same.
; Original post
00401000 cdq
00401001 idiv eax,ecx
00401003 mov eax,edx
00401005 test eax,eax
00401007 jge euc+0Bh (40100Bh)
00401009 add eax,ecx
0040100B ret
; Mine
00401020 cdq
00401021 idiv eax,ecx
00401023 xor eax,eax
00401025 test edx,edx
00401027 setl al
0040102A neg eax
0040102C and eax,ecx
0040102E add eax,edx
00401030 ret
I think 280Z28 and Christopher have the assembler golf covered better than I would, and that deals with random access.
What you're actually doing, though, seems to be processing entire arrays. Obviously for reasons of memory caching you already want to be doing this in order if possible, since avoiding a cache miss is a many, many times better optimisation than avoiding a small branch.
In that case, with a suitable bounds check first you can do the inner loop in what I will call "dashes". Check that the next k increments don't result in an overflow in the smallest dimension on either array, and then "dash" k steps using a new even-more-inner loop which just increments the "physical" index by 1 each time instead of doing another idiv. You or the compiler can unroll this loop, use Duff's device, etc.
If the kernel is small, and especially if it is fixed size, then that (or a multiple of it with suitable unrolling to occasionally subtract instead of adding), is probably the value to use for the length of the "dash". Compile-time constant dash length is probably best, since then you (or the compiler) can fully unroll the dash loop and elide the continuation condition. As long as this doesn't make the code too big to be fast, it essentially replaces the entire positive-modulo operation with an integer increment.
If the kernel is not fixed size, but is often very small in its last dimension, consider having different versions of the comparison function for the most common sizes, with the dash loop fully unrolled in each.
Another possibility is to calculate the next point at which an overflow will occur (in either array), and then dash to that value. You still have a continuation condition in the dash loop, but it goes as long as possible using only increments.
Alternatively, if operation you're doing is numeric equality or some other simple operation (I don't know what a "correlation" is) you could look at SIMD instructions or whatever, in which case the dash length should be a multiple of the widest single-instruction compare (or appropriate SIMD op) on your architecture. This isn't something I have experience with, though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With