I have a cpu-consuming code, where some function with a loop is executed many times. Every optimization in this loop brings noticeable performance gain. Question: How would you optimize this loop (there is not much more to optimize though...)?
void theloop(int64_t in[], int64_t out[], size_t N)
{
for(uint32_t i = 0; i < N; i++) {
int64_t v = in[i];
max += v;
if (v > max) max = v;
out[i] = max;
}
}
I tried a few things, e.g. I replaced arrays with pointers that were incremented in every loop, but (surprisingly) i lost some performance instead of gaining...
Edit:
itsMaximums
, error)int64_t
, so are negative and positiveN
is unknown at compile time_m128i
, executing and storing back was higher than than SSE speed gain. Yet I am not an expert on SSE, so maybe I had a poor code)Results:
I added some loop unfolding, and a nice hack from Alex'es post. Below I paste some results:
strage, that 4) is not faster than 3) and 4). Below code for 4):
for(size_t i = 1; i < N; i+=CHUNK) {
int64_t t_in0 = in[i+0];
int64_t t_in1 = in[i+1];
int64_t t_in2 = in[i+2];
int64_t t_in3 = in[i+3];
max &= -max >> 63;
max += t_in0;
out[i+0] = max;
max &= -max >> 63;
max += t_in1;
out[i+1] = max;
max &= -max >> 63;
max += t_in2;
out[i+2] = max;
max &= -max >> 63;
max += t_in3;
out[i+3] = max;
}
First, you need to look at the generated assembly. Otherwise you have no way of knowing what actually happens when this loop is executed.
Now: is this code running on a 64-bit machine? If not, those 64-bit additions might hurt a bit.
This loop seems an obvious candidate for using SIMD instructions. SSE2 supports a number of SIMD instructions for integer arithmetics, including some that work on two 64-bit values.
Other than that, see if the compiler properly unrolls the loop, and if not, do so yourself. Unroll a couple of iterations of the loop, and then reorder the hell out of it. Put all the memory loads at the top of the loop, so they can be started as early as possible.
For the if
line, check that the compiler is generating a conditional move, rather than a branch.
Finally, see if your compiler supports something like the restrict
/__restrict
keyword. It's not standard in C++, but it is very useful for indicating to the compiler that in
and out
do not point to the same addresses.
Is the size (N
) known at compile-time? If so, make it a template parameter (and then try passing in
and out
as references to properly-sized arrays, as this may also help the compiler with aliasing analysis)
Just some thoughts off the top of my head. But again, study the disassembly. You need to know what the compiler does for you, and especially, what it doesn't do for you.
Edit
with your edit:
max &= -max >> 63;
max += t_in0;
out[i+0] = max;
what strikes me is that you added a huge dependency chain. Before the result can be computed, max must be negated, the result must be shifted, the result of that must be and'ed together with its original value, and the result of that must be added to another variable.
In other words, all these operations have to be serialized. You can't start one of them before the previous has finished. That's not necessarily a speedup. Modern pipelined out-of-order CPUs like to execute lots of things in parallel. Tying it up with a single long chain of dependant instructions is one of the most crippling things you can do. (Of course, it if can be interleaved with other iterations, it might work out better. But my gut feeling is that a simple conditional move instruction would be preferable)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With