Furthermore, how does the compiler determine the extent to unroll a loop, assuming all operations in the loop are completely independent of other iterations.
Loop unrolling in C code When a loop is unrolled, the loop counter requires updating less often and fewer branches are executed. If the loop iterates only a few times, it can be fully unrolled so that the loop overhead completely disappears. The compiler unrolls loops automatically at -O3 .
Loop unrolling is a technique used to increase the number of instructions executed between executions of the loop branch logic. This reduces the number of times the loop branch logic is executed.
Loop unrolling is a technique that epitomizes the space vs time tradeoff. It's an optimization that expands loops to make them execute more efficiently. Many loops operate inefficiently because the loop counter has to be updated after each execution.
Unrolled loops are not always faster. They generate larger binaries. They require more instruction decoding. They use more memory and instruction cache.
For MSVC there is only a vector independence hint: http://msdn.microsoft.com/en-us/library/hh923901.aspx
#pragma loop( ivdep )
For many other compilers, like Intel/ibm, there a several pragma hints for optimizing a loop:
#pragma unroll
#pragma loop count N
#pragma ivdep
There is a thread with MSVC++ people about unroll heuristic: http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/d0b225c2-f5b0-4bb9-ac6a-4d4f61f7cb17/
VC tries to balance execution speed and code size. You can change the balance by using flags /O1 or /O2, but even when optimzing for speed VC tries to conserve code size as well.
Basically, unroll will increase code size, so it may be limited in Os and O1 modes (modes table)
PS: Pragma looks like preprocessor directive, but it is not. It is a directive for compiler and it it ignored (kept) by preprocessor.
In the case of Intel Compiler:
#pragma loop count N helps the compiler to use the best strategy in order to vectorize the loop. It saves time So, we can say it helps to drive the loop unrolling. Examples:
#pragma loop_count min(n),max(n),avg(n)
#pragma unroll (n) works only when used with -O3 flag, you can use the following strategy to unroll your loop according to target processor.
Besides the increased code generated by loop unrolling, it may worth, since the compiler will produce loop's version for scalar operations as well for vector operations.
In cases where unrolling is affecting performance, for instance: loop with 20 iterations with vector length 16, results in 1 loop that executes 16 operations at once and a remainder loop that executes 4 sequentially. To avoid remainder loop generated by the compiler we can use before the loop:
#pragma vector novecremainder //or -mP2OPT_hpo_vec_peel = F to disable peel and remainder loops (compiler internal option)
or
#pragma nounroll //where unrolling is not worth at all
Just to clarify the #pragma ivdep :
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With