In C, I have a task where I must do multiplication, inversion, trasposition, addition etc. etc. with huge matrices allocated as 2-dimensional arrays, (arrays of arrays).
I have found the gcc flag -funroll-all-loops
. If I understand correctly, this will unroll all loops automatically without any efforts by the programmer.
My questions:
a) Does gcc include this kind of optimization with the various optimization flags as -O1
, -O2
etc.?
b) Do I have to use any pragma
s inside my code to take advantage of loop unrolling or are loops identified automatically?
c) Why is this option not enabled by default if the unrolling increases the performance?
d) What are the recommended gcc optimization flags to compile the program in the best way possible? (I must run this program optimized for a single CPU family, that is the same of the machine where I compile the code, actually I use march=native
and -O2
flags)
EDIT
Seems that there are controversities about the use of unroll that in some cases may slow down the performance. In my situations there are various methods that do simply math operations in 2 nested for cycles for iterate matrix elements done for an huge amount of elements. In this scenario how unroll could slow down or increase the performance?
Since loop unrolling is a tradeoff between code size and speed, the effectiveness of loop unrolling is highly dependent on the loop unrolling factor, that is, the number of times the loop is unrolled. As the factor increases, code size increases, leading to potential issues with the front end, such as icache misses.
In fact, many compilers will happily (and silently) unroll loops for you (though not always). Unrolled loops are not always faster. They generate larger binaries. They require more instruction decoding.
Improved floating-point performance - loop unrolling can improve performance by providing the compiler more instructions to schedule across the unrolled iterations. This reduces the number of NOPs generated and also provides the compiler with a greater opportunity to generate parallel instructions.
Classic loop unrolling allows to increase the performance of sequential loops by reducing the overheads of the non-computational parts of the loop. Unfortunately, when the loop contains parallelism inside most compilers will ignore it or perform a naïve transformation.
Modern processors pipeline instructions. They like knowing what's coming next and make all sorts of fancy optimisations based on assumptions of which order the instructions should be executed.
At the end of a loop though, there are two possibilities! Either you go back to the top, or continue on. The processor makes an educated guess on which is going to happen. If it gets it right, everything is good. If not, it has to flush the pipeline and stall for a bit while it prepares for taking the other branch.
As you can imagine, unrolling a loop eliminates branches and the potential for those stalls, especially in cases where the odds are against a guess.
Imagine a loop of code that executes 3 times, then continues. If you assume (as the processor probably would) that at the end you'll repeat the loop. 2/3 of the time, you'll be correct! 1/3 of the time though, you'll stall.
On the other hand, imagine the same situation, but the code loops 3000 times. Here, there's probably only a gain 1/3000 of the time from unrolling.
Part of the processor fanciness mentioned above involves loading the instructions from the executable in memory into the processor's onboard instruction cache (shortened to I-cache). This holds a limited amount of instructions which can be accessed quickly, but may stall when new instructions need to be loaded from memory.
Let's go back to the previous examples. Assume a reasonably small amount of code inside the loop takes up n
bytes of I-cache. If we unroll the loop, it's now taking up n * 3
bytes. A bit more, but it'll probably fit in a single cache line just fine so your cache will be working optimally and not needing to stall reading from main memory.
The 3000-loop, however, unrolls to use a whopping n * 3000
bytes of I-cache. That's going to require several reads from memory, and probably push some other useful stuff from elsewhere in the program out of the I-cache.
As you can see, unrolling provides more benefits for shorter loops but ends up trashing performance if you're intending to loop a large number of times.
Usually, a smart compiler will take a decent guess about which loops to unroll but you can force it if you're sure you know better. How do you get to know better? The only way is to try it both ways and compare timings!
Premature optimization is the root of all evil -- Donald Knuth
Profile first, optimise later.
Loop unrolling does not work if the compiler can't predict the exact amount of iterations of the loop at compile time (or at least predict an upper bound, and then skip as many iterations as needed). This means that if your matrix size is variable, the flag will have no effect.
Now to answer your questions:
a) Does gcc include this kind of optimization with the various optimization flags as -O1, -O2 etc.?
Nope, you have to explicitly set it since it may or may not make the code run faster and it usually makes the executable bigger.
b) Do I have to use any pragmas inside my code to take advantage of loop unrolling or are loops identified automatically?
No pragmas. With -funroll-loops
the compiler heuristically decides which loops to unroll. If you want to force unrolling you can use -funroll-all-loops
, but it usually makes the code run slower.
c) Why is this option not enabled by default if the unrolling increases the performance?
It doesn't always increase performance! Also, not everything is about performance. Some people actually care about having small executables since they have little memory (see: embedded systems)
d) What are the recommended gcc optimization flags to compile the program in the best way possible? (I must run this program optimized for a single CPU family, that is the same of the machine where I compile the code, actually I use march=native and -O2 flags)
There's no silver bullet. You'll need to think, test and see. There is actually a theorem that states that no perfect compiler can ever exist.
Did you profile your program? Profiling is a very useful skill for these things.
Source (mostly): https://gcc.gnu.org/onlinedocs/gcc-3.4.4/gcc/Optimize-Options.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With