Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimize this function (in C++)

I have a cpu-consuming code, where some function with a loop is executed many times. Every optimization in this loop brings noticeable performance gain. Question: How would you optimize this loop (there is not much more to optimize though...)?

void theloop(int64_t in[], int64_t out[], size_t N)
{
    for(uint32_t i = 0; i < N; i++) {
        int64_t v = in[i];
        max += v;
        if (v > max) max = v;
        out[i] = max;
    }
}

I tried a few things, e.g. I replaced arrays with pointers that were incremented in every loop, but (surprisingly) i lost some performance instead of gaining...

Edit:

  • changed name of one variable (itsMaximums, error)
  • the function is an a method of a class
  • in and put are int64_t , so are negative and positive
  • `(v > max) can evaluate to true: consider the situation when actual max is negative
  • the code runs on 32-bit pc (development) and 64-bit (production)
  • N is unknown at compile time
  • I tried some SIMD, but I failed to increase performance... (the overhead of moving the variables to _m128i, executing and storing back was higher than than SSE speed gain. Yet I am not an expert on SSE, so maybe I had a poor code)

Results:

I added some loop unfolding, and a nice hack from Alex'es post. Below I paste some results:

  1. original: 14.0s
  2. unfolded loop (4 iterations): 10.44s
  3. Alex'es trick: 10.89s
  4. 2) and 3) at once: 11.71s

strage, that 4) is not faster than 3) and 4). Below code for 4):

for(size_t i = 1; i < N; i+=CHUNK) {
    int64_t t_in0 = in[i+0];
    int64_t t_in1 = in[i+1];
    int64_t t_in2 = in[i+2];
    int64_t t_in3 = in[i+3];


    max &= -max >> 63;
    max += t_in0;
    out[i+0] = max;

    max &= -max >> 63;
    max += t_in1;
    out[i+1] = max;

    max &= -max >> 63;
    max += t_in2;
    out[i+2] = max;

    max &= -max >> 63;
    max += t_in3;
    out[i+3] = max;

}
like image 928
Jakub M. Avatar asked Nov 15 '11 08:11

Jakub M.


1 Answers

First, you need to look at the generated assembly. Otherwise you have no way of knowing what actually happens when this loop is executed.

Now: is this code running on a 64-bit machine? If not, those 64-bit additions might hurt a bit.

This loop seems an obvious candidate for using SIMD instructions. SSE2 supports a number of SIMD instructions for integer arithmetics, including some that work on two 64-bit values.

Other than that, see if the compiler properly unrolls the loop, and if not, do so yourself. Unroll a couple of iterations of the loop, and then reorder the hell out of it. Put all the memory loads at the top of the loop, so they can be started as early as possible.

For the if line, check that the compiler is generating a conditional move, rather than a branch.

Finally, see if your compiler supports something like the restrict/__restrict keyword. It's not standard in C++, but it is very useful for indicating to the compiler that in and out do not point to the same addresses.

Is the size (N) known at compile-time? If so, make it a template parameter (and then try passing in and out as references to properly-sized arrays, as this may also help the compiler with aliasing analysis)

Just some thoughts off the top of my head. But again, study the disassembly. You need to know what the compiler does for you, and especially, what it doesn't do for you.

Edit

with your edit:

max &= -max >> 63;
max += t_in0;
out[i+0] = max;

what strikes me is that you added a huge dependency chain. Before the result can be computed, max must be negated, the result must be shifted, the result of that must be and'ed together with its original value, and the result of that must be added to another variable.

In other words, all these operations have to be serialized. You can't start one of them before the previous has finished. That's not necessarily a speedup. Modern pipelined out-of-order CPUs like to execute lots of things in parallel. Tying it up with a single long chain of dependant instructions is one of the most crippling things you can do. (Of course, it if can be interleaved with other iterations, it might work out better. But my gut feeling is that a simple conditional move instruction would be preferable)

like image 82
jalf Avatar answered Oct 16 '22 06:10

jalf