When is loop unwinding effective?

Question

Loop unwinding is a common way to help the compiler to optimize performance. I was wondering if and to what extent the performance gain is affected by what is in the body of the loop:

number of statements
number of function calls
use of complex data types, virtual methods, etc.
dynamic (de)allocation of memory

What rules (of thumb?) do you use to decide whether or not to unwind a performance critical loop? What other optimisation do you consider in these cases?

Nils Pipenbrinck · Accepted Answer

In general unrolling loops by hand is not worth the effort. The compiler knows better how the target architecture works and will unroll the loop if it is beneficial.

There are code-paths that benefit when unrolled for Pentium-M type CPU's but don't benefit for Core2 for example. If I unroll by hand the compiler can't make the decision anymore and I may end up with less than optimal code. E.g. exactly the opposite I tried to achieve.

There are several cases where I do unroll performance critical loops by hand, but I only do this if I know that the compiler will - after manual unrolling - be able to use architectural specific feature such as SSE or MMX instructions. Then, and only then I do it.

Btw - modern CPUs are very efficient at executing well predictable branches. This is exactly what a loop is. The loop overhead is so small these days that it rarely makes a difference. Memory latency effects that may occur due to the increase in code-size will however make a difference.

Jon Topper · Answer

This is an optimisation question, and as such there is only one rule of thumb: test the performance, and try a loop unwinding optimisation only if your testing demonstrates that you need to. Consider less disruptive optimisations first.

jilles de wit · Answer

In my experience, loop unwinding, and the work it takes is effective when:

There are only a few statements within the loop.
the statements involve only small number of different variables and no function calls
Your operations work on already allocated memory (an in-place image transformation for example)

Partial unwinding is often less work for 80% of the gain. So instead of looping over all pixels of an N by M image (NM iterations) where N is always divisible by 8, loop (NM/8) times over each block of eight pixels. This is especially efficient if you are perfoming some operation that uses some of the neighbouring pixels.

I've had very good results hand-optimising pixel-wise operations into MMX or SSE instructions (8 or 16 pixels at a time) but I've also spent days hand optimising something only to find out that the version optimised by the compiler ran ten times faster.

And by the way, for the most (beautiful|remarkable) example of loop unwinding check out Duffs device

Galghamon · Answer

An important thing to consider: In production code at your workplace, future readability of your code far outweighs the benefits of loop unwinding. Hardware is cheap, programmer time is not. I would only worry about loop unwinding if that is the ONLY way to solve a proven performance problem (say in a low-powered device).

Other thoughts: The characteristics of compilers vary greatly, and in some cases, like Java, the determination is made on the fly by the HotspotJVM, so there I would argue against loop unwinding in any case.

When is loop unwinding effective?

Tags:

c++

performance

algorithm

optimization

andreas buykx

4 Answers

Nils Pipenbrinck

Jon Topper

jilles de wit

Galghamon

Recent Activity

Donate For Us

When is loop unwinding effective?

Tags:

c++

performance

algorithm

optimization

andreas buykx

4 Answers

Nils Pipenbrinck

Jon Topper

jilles de wit

Galghamon

Related questions

Recent Activity

Donate For Us