Loop unrolling & optimization

Question

Given the code :

for (int i = 0; i < n; ++i) 
{ 
  A(i) ; 
  B(i) ; 
  C(i) ; 
}

And the optimization version :

for (int i = 0; i < (n - 2); i+=3) 
{ 
  A(i) 
  A(i+1) 
  A(i+2) 
  B(i) 
  B(i+1) 
  B(i+2) 
  C(i) 
  C(i+1) 
  C(i+2)
}

Something is not clear to me : which is better ? I can't see anything that works any faster using the other version . Am I missing something here ?

All I see is that each instruction is depending on the previous instruction , meaning that I need to wait that the previous instruction would finish in order to start the one after ...

Thanks

Nathaniel Ford · Accepted Answer

In the high-level view of a language, you're not going to see the optimization. The speed enhancement comes from what the compiler does with what you have.

In the first case, it's something like:

LOCATION_FLAG;
DO_SOMETHING;
TEST FOR LOOP COMPLETION;//Jumps to LOCATION_FLAG if false

In the second it's something like:

LOCATION_FLAG;
DO_SOMETHING;
DO_SOMETHING;
DO_SOMETHING;
TEST FOR LOOP COMPLETION;//Jumps to LOCATION_FLAG if false

You can see in the latter case, the overhead of testing and jumping is only 1 instruction per 3. In the first it's 1 instruction per 1; so it happens a lot more often.

Therefore, if you have invariants you can rely on (an array of mod 3, to use your example) then it is more efficient to unwind loops because the underlying assembly is written more directly.

P.P · Answer

Loop unrolling is used to reduce the number of jump & branch instructions which could potentially make the loop faster but will increase the size of the binary. Depending on the implementation and platform, either could be faster.

Johan Kotlinski · Answer

Well, whether this code is "better" or "worse" totally depends on implementations of A, B and C, which values of n you expect, which compiler you are using and which hardware you are running on.

Typically the benefit of loop unrolling is that the overhead of doing the loop (that is, increasing i and comparing it with n) is reduced. In this case, could be reduced by a factor of 3.

Baldy · Answer

As long as the functions A(), B() and C() don't modify the same datasets, the second verion provides more parallelization options.

In the first version, the three functions could run simultaneously, assuming no interdependencies. In the second version, all three functions could be run with all three datasets at the same time, assuming you had enough execution units to do so and again, no interdependencies.

Loop unrolling & optimization

Tags:

c

optimization

JAN

4 Answers

Nathaniel Ford

P.P

Johan Kotlinski

Baldy

Recent Activity

Donate For Us

Loop unrolling & optimization

Tags:

c

optimization

JAN

4 Answers

Nathaniel Ford

P.P

Johan Kotlinski

Baldy

Related questions

Recent Activity

Donate For Us