Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast Euclidean division in C

I am interested in getting the remainder of the Euclidean division, that is, for a pair of integers (i, n), find r such as:

i = k * n + r, 0 <= r < |k|

the simple solution is:

int euc(int i, int n)
{
    int r;

    r = i % n;
    if ( r < 0) {
        r += n;
    }
    return r;
}

But since I need to execute this tens of million of times (it is used inside an iterator for multidimensional arrays), I would like to avoid the branching if possible. Requirements:

  • Branching but faster is also desirable.
  • A solution which works only for positive n is acceptable (but it has to work for negative i).
  • n is not known in advance, and can be any value > 0 and < MAX_INT

Edit

It is actually quite easy to get the result wrong, so here is an example of the expected results:

  • euc(0, 3) = 0
  • euc(1, 3) = 1
  • euc(2, 3) = 2
  • euc(3, 3) = 0
  • euc(-1, 3) = 2
  • euc(-2, 3) = 1
  • euc(-3, 3) = 0

Some people also worry that it does not make sense to optimize this. I need this for an multi-dimensional iterator where out of bounds items are replaced by items in a 'virtual array' which repeats the original array. So if my array x is [1, 2, 3, 4], the virtual array is [...., 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4], and for example, x[-2] is x1, etc...

For a nd array of dimension d, I need d Euclidean division for every point. If I need to do a correlation between a n^d array with a m^d kernel, I need n^d * m^d * d euclidean divisions. For a 3d image of 100x100x100 points and a kernel of 5*5*5 points, that's already ~ 400 million Euclidean divisions.

like image 983
David Cournapeau Avatar asked Jul 16 '09 04:07

David Cournapeau


2 Answers

Edit: No multiplication or branches woot.

int euc(int i, int n)
{
    int r;

    r = i % n;
    r += n & (-(r < 0));

    return r;
}

Here is the generated code. According to the MSVC++ instrumenting profiler (my testing) and the OP's testing, they perform nearly the same.

; Original post
00401000  cdq              
00401001  idiv        eax,ecx 
00401003  mov         eax,edx 
00401005  test        eax,eax 
00401007  jge         euc+0Bh (40100Bh) 
00401009  add         eax,ecx 
0040100B  ret              

; Mine
00401020  cdq              
00401021  idiv        eax,ecx 
00401023  xor         eax,eax 
00401025  test        edx,edx 
00401027  setl        al   
0040102A  neg         eax  
0040102C  and         eax,ecx 
0040102E  add         eax,edx 
00401030  ret              
like image 189
Sam Harwell Avatar answered Oct 02 '22 01:10

Sam Harwell


I think 280Z28 and Christopher have the assembler golf covered better than I would, and that deals with random access.

What you're actually doing, though, seems to be processing entire arrays. Obviously for reasons of memory caching you already want to be doing this in order if possible, since avoiding a cache miss is a many, many times better optimisation than avoiding a small branch.

In that case, with a suitable bounds check first you can do the inner loop in what I will call "dashes". Check that the next k increments don't result in an overflow in the smallest dimension on either array, and then "dash" k steps using a new even-more-inner loop which just increments the "physical" index by 1 each time instead of doing another idiv. You or the compiler can unroll this loop, use Duff's device, etc.

If the kernel is small, and especially if it is fixed size, then that (or a multiple of it with suitable unrolling to occasionally subtract instead of adding), is probably the value to use for the length of the "dash". Compile-time constant dash length is probably best, since then you (or the compiler) can fully unroll the dash loop and elide the continuation condition. As long as this doesn't make the code too big to be fast, it essentially replaces the entire positive-modulo operation with an integer increment.

If the kernel is not fixed size, but is often very small in its last dimension, consider having different versions of the comparison function for the most common sizes, with the dash loop fully unrolled in each.

Another possibility is to calculate the next point at which an overflow will occur (in either array), and then dash to that value. You still have a continuation condition in the dash loop, but it goes as long as possible using only increments.

Alternatively, if operation you're doing is numeric equality or some other simple operation (I don't know what a "correlation" is) you could look at SIMD instructions or whatever, in which case the dash length should be a multiple of the widest single-instruction compare (or appropriate SIMD op) on your architecture. This isn't something I have experience with, though.

like image 34
Steve Jessop Avatar answered Oct 01 '22 23:10

Steve Jessop