Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do optimizing compilers decide when and how much to unroll a loop?

When a compiler performs a loop-unroll optimization, how does it determined by which factor to unroll the loop or whether to unroll the whole loop? Since this is a space-performance trade-off, on average how effictive is this optimization technique in making the program perform better? Also, under what conditions is it recommended to use this technique (i.e certain operations or calculations)?

This doesn't have to be specific to a certain compiler. It can be any explanation outlining the idea behind this technique and what has been observed in practice.

like image 562
Mike G Avatar asked Oct 07 '11 18:10

Mike G


2 Answers

When a compiler performs a loop unroll optimization, how does it determined by which factor to unroll the loop or weather to unroll the whole loop or not.

stack consumption and locality. instruction counts. ability to make/propagate optimizations based on the unrolled and inlined program. whether the loop size is fixed, or expected to be in a certain range. profile inputs (if applicable). operations which may be removed from the loop body. etc.

Since this is a space-performance tradeoff on average how effictive is this optimization technique in making the program perform better?

it depends largely on the input (your program). it can be slower (not typical) or it can be several times faster. writing a program to run optimally and which also enables the optimizer to do its job is learned.

Also, under what conditions is it recommended to use this technique (i.e certain operations or calculations)

generally, a large number of iterations on very small bodies, particularly that which is branchless and has good data locality.

if you want to know if the option helps your app, profile.

if you need more than that, you should reserve some time to learn how to write optimal programs, since the subject is quite complex.

like image 59
justin Avatar answered Sep 20 '22 14:09

justin


The simplistic analysis is to count instructions - a 2 instruction loop unrolled 10 times has 11 instructions instead of 20 yields a 11/20 speedup. But with modern processor architectures it's much more complex; depending on cache sizes and the characteristics of the processors instruction pipeline. It's possible that the above example would run 10x faster instead of 2x. It's also possible that unrolling 1000x instead of 10x would run slower. Without targeting a specific processor, compilers (or pragmas you write for them) are just guessing.

like image 30
ddyer Avatar answered Sep 19 '22 14:09

ddyer