Given the code :
for (int i = 0; i < n; ++i)
{
A(i) ;
B(i) ;
C(i) ;
}
And the optimization version :
for (int i = 0; i < (n - 2); i+=3)
{
A(i)
A(i+1)
A(i+2)
B(i)
B(i+1)
B(i+2)
C(i)
C(i+1)
C(i+2)
}
Something is not clear to me : which is better ? I can't see anything that works any faster using the other version . Am I missing something here ?
All I see is that each instruction is depending on the previous instruction , meaning that I need to wait that the previous instruction would finish in order to start the one after ...
Thanks
In the high-level view of a language, you're not going to see the optimization. The speed enhancement comes from what the compiler does with what you have.
In the first case, it's something like:
LOCATION_FLAG;
DO_SOMETHING;
TEST FOR LOOP COMPLETION;//Jumps to LOCATION_FLAG if false
In the second it's something like:
LOCATION_FLAG;
DO_SOMETHING;
DO_SOMETHING;
DO_SOMETHING;
TEST FOR LOOP COMPLETION;//Jumps to LOCATION_FLAG if false
You can see in the latter case, the overhead of testing and jumping is only 1 instruction per 3. In the first it's 1 instruction per 1; so it happens a lot more often.
Therefore, if you have invariants you can rely on (an array of mod 3, to use your example) then it is more efficient to unwind loops because the underlying assembly is written more directly.
Loop unrolling is used to reduce the number of jump & branch instructions which could potentially make the loop faster but will increase the size of the binary. Depending on the implementation and platform, either could be faster.
Well, whether this code is "better" or "worse" totally depends on implementations of A
, B
and C
, which values of n
you expect, which compiler you are using and which hardware you are running on.
Typically the benefit of loop unrolling is that the overhead of doing the loop (that is, increasing i
and comparing it with n
) is reduced. In this case, could be reduced by a factor of 3.
As long as the functions A(), B() and C() don't modify the same datasets, the second verion provides more parallelization options.
In the first version, the three functions could run simultaneously, assuming no interdependencies. In the second version, all three functions could be run with all three datasets at the same time, assuming you had enough execution units to do so and again, no interdependencies.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With