Given the following code
#include <stdio.h>
int main(int argc, char **argv)
{
int k = 0;
for( k = 0; k < 20; ++k )
{
printf( "%d\n", k ) ;
}
}
Using GCC 5.1 or later with
-x c -std=c99 -O3 -funroll-all-loops --param max-completely-peeled-insns=1000 --param max-completely-peel-times=10000
does partially loop unrolling, it unrolls the loop ten times and then does a conditional jump.
.LC0:
.string "%d\n"
main:
pushq %rbx
xorl %ebx, %ebx
.L2:
movl %ebx, %esi
movl $.LC0, %edi
xorl %eax, %eax
call printf
leal 1(%rbx), %esi
movl $.LC0, %edi
xorl %eax, %eax
call printf
leal 2(%rbx), %esi
movl $.LC0, %edi
xorl %eax, %eax
call printf
leal 3(%rbx), %esi
movl $.LC0, %edi
xorl %eax, %eax
call printf
leal 4(%rbx), %esi
movl $.LC0, %edi
xorl %eax, %eax
call printf
leal 5(%rbx), %esi
movl $.LC0, %edi
xorl %eax, %eax
call printf
leal 6(%rbx), %esi
movl $.LC0, %edi
xorl %eax, %eax
call printf
leal 7(%rbx), %esi
movl $.LC0, %edi
xorl %eax, %eax
call printf
leal 8(%rbx), %esi
movl $.LC0, %edi
xorl %eax, %eax
call printf
leal 9(%rbx), %esi
xorl %eax, %eax
movl $.LC0, %edi
addl $10, %ebx
call printf
cmpl $20, %ebx
jne .L2
xorl %eax, %eax
popq %rbx
ret
But using older versions of GCC such as 4.9.2 creates the desired assemlby
.LC0:
.string "%d\n"
main:
subq $8, %rsp
xorl %edx, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
movl $1, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
movl $2, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
movl $3, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
movl $4, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
movl $5, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
movl $6, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
movl $7, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
movl $8, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
movl $9, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
movl $10, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
movl $11, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
movl $12, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
movl $13, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
movl $14, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
movl $15, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
movl $16, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
movl $17, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
movl $18, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
movl $19, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
xorl %eax, %eax
addq $8, %rsp
ret
It there a way to force the later versions of GCC to produce the same output?
Using https://godbolt.org/g/D1AR6i to produce the assembly
EDIT: No duplicated question, since the problem to completly unroll loops with later versions of GCC has not yet been solved. Passing --param max-completely-peeled-insns=1000 --param max-completely-peel-times=10000
has not effects on the generated assembly using GCC >= 5.1
I have found the gcc flag -funroll-all-loops . If I understand correctly, this will unroll all loops automatically without any efforts by the programmer.
4.1) Basic Compiler Techniques for Exposing ILP (page 304) Unrolling simply replicates the loop body multiple times, adjusting the loop termination code. Loop unrolling can also be used to improve scheduling. Because it eliminates the branch, it allows instructions from different iterations to be scheduled together.
Loop unrolling is a loop transformation technique that helps to optimize the execution time of a program. We basically remove or reduce iterations. Loop unrolling increases the program's speed by eliminating loop control instruction and loop test instructions.
Improved floating-point performance - loop unrolling can improve performance by providing the compiler more instructions to schedule across the unrolled iterations. This reduces the number of NOPs generated and also provides the compiler with a greater opportunity to generate parallel instructions.
The flags and parameters you are using do not guarantee that the loops will be completely unrolled. The GCC documentation states the following regarding the -funroll-all-loops
flag you are using:
turns on complete loop peeling (i.e. complete removal of loops with a small constant number of iterations)
If the compiler decides that the number of iterations for a given piece of code is not "a small constant" (i.e. number is too high), it may only do partial peeling or unrolling as it has done here. Furthermore, the param
options you are using are only maximum values, but do not force complete unrolling for loops smaller than the set value. In other words, if a loop has more iterations than the maximum you have set, then the loop will not to be completely unrolled; but the inverse is not true.
Many factors are taken into account when doing optimisations. Here the bottleneck in your code is the call to printf
function, and the compiler will probably take this into account when doing its cost calculations, or judge that the instruction size overhead for unrolling is too important. As you are nevertheless telling it to unroll loops, it seems to determine that the best solution is to transform the initial loop with 10 unrolls and a a jump.
If you replace printf
with something else, the compiler may optimise differently. For instance try replacing it with the following:
volatile int temp = k;
The loop with this new code snippet will be fully unrolled on the newer versions of GCC (and the older ones as well). Note that the volatile keyword is just a trick used so the compiler does not optimise out the loop completely.
To sum up, to the best of my knowledge there is no way to force later versions of GCC to produce the same output.
As a side note, from optimisation level -O2
onwards and without any additional compiler flags, recent versions of Clang fully unroll your loop.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With