GCC 5.1 Loop unrolling

Tags:

Given the following code

#include <stdio.h>

int main(int argc, char **argv)
{
  int k = 0;
  for( k = 0; k < 20; ++k )
  {
    printf( "%d\n", k ) ;
  }
}

Using GCC 5.1 or later with

-x c -std=c99 -O3 -funroll-all-loops --param max-completely-peeled-insns=1000 --param max-completely-peel-times=10000

does partially loop unrolling, it unrolls the loop ten times and then does a conditional jump.

.LC0:
        .string "%d\n"
main:
        pushq   %rbx
        xorl    %ebx, %ebx
.L2:
        movl    %ebx, %esi
        movl    $.LC0, %edi
        xorl    %eax, %eax
        call    printf
        leal    1(%rbx), %esi
        movl    $.LC0, %edi
        xorl    %eax, %eax
        call    printf
        leal    2(%rbx), %esi
        movl    $.LC0, %edi
        xorl    %eax, %eax
        call    printf
        leal    3(%rbx), %esi
        movl    $.LC0, %edi
        xorl    %eax, %eax
        call    printf
        leal    4(%rbx), %esi
        movl    $.LC0, %edi
        xorl    %eax, %eax
        call    printf
        leal    5(%rbx), %esi
        movl    $.LC0, %edi
        xorl    %eax, %eax
        call    printf
        leal    6(%rbx), %esi
        movl    $.LC0, %edi
        xorl    %eax, %eax
        call    printf
        leal    7(%rbx), %esi
        movl    $.LC0, %edi
        xorl    %eax, %eax
        call    printf
        leal    8(%rbx), %esi
        movl    $.LC0, %edi
        xorl    %eax, %eax
        call    printf
        leal    9(%rbx), %esi
        xorl    %eax, %eax
        movl    $.LC0, %edi
        addl    $10, %ebx
        call    printf
        cmpl    $20, %ebx
        jne     .L2
        xorl    %eax, %eax
        popq    %rbx
        ret

But using older versions of GCC such as 4.9.2 creates the desired assemlby

.LC0:
    .string "%d\n"
main:
    subq    $8, %rsp
    xorl    %edx, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    movl    $1, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    movl    $2, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    movl    $3, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    movl    $4, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    movl    $5, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    movl    $6, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    movl    $7, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    movl    $8, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    movl    $9, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    movl    $10, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    movl    $11, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    movl    $12, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    movl    $13, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    movl    $14, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    movl    $15, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    movl    $16, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    movl    $17, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    movl    $18, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    movl    $19, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    xorl    %eax, %eax
    addq    $8, %rsp
    ret

It there a way to force the later versions of GCC to produce the same output?

Using https://godbolt.org/g/D1AR6i to produce the assembly

EDIT: No duplicated question, since the problem to completly unroll loops with later versions of GCC has not yet been solved. Passing --param max-completely-peeled-insns=1000 --param max-completely-peel-times=10000 has not effects on the generated assembly using GCC >= 5.1

430

asked Jun 22 '16 12:06

surrz

1 Answers

The flags and parameters you are using do not guarantee that the loops will be completely unrolled. The GCC documentation states the following regarding the -funroll-all-loops flag you are using:

turns on complete loop peeling (i.e. complete removal of loops with a small constant number of iterations)

If the compiler decides that the number of iterations for a given piece of code is not "a small constant" (i.e. number is too high), it may only do partial peeling or unrolling as it has done here. Furthermore, the param options you are using are only maximum values, but do not force complete unrolling for loops smaller than the set value. In other words, if a loop has more iterations than the maximum you have set, then the loop will not to be completely unrolled; but the inverse is not true.

Many factors are taken into account when doing optimisations. Here the bottleneck in your code is the call to printf function, and the compiler will probably take this into account when doing its cost calculations, or judge that the instruction size overhead for unrolling is too important. As you are nevertheless telling it to unroll loops, it seems to determine that the best solution is to transform the initial loop with 10 unrolls and a a jump.

If you replace printf with something else, the compiler may optimise differently. For instance try replacing it with the following:

volatile int temp = k;

The loop with this new code snippet will be fully unrolled on the newer versions of GCC (and the older ones as well). Note that the volatile keyword is just a trick used so the compiler does not optimise out the loop completely.

To sum up, to the best of my knowledge there is no way to force later versions of GCC to produce the same output.

As a side note, from optimisation level -O2 onwards and without any additional compiler flags, recent versions of Clang fully unroll your loop.

169

answered Sep 20 '22 16:09

Pyves

Related questions
                            
                                Casting pointer types on different architectures
                            
                                Prevent kernel from processing TCP segments bound to a raw socket
                            
                                Ambiguity in 2d array declaration in C
                            
                                No warning when returning NULL with gcc
                            
                                How to mimic a "use_crc" flag on (hacked) 1998 uncompress() in 2013 zlib API interface?
                            
                                Compiling C from PHP with exec error trying to exec 'cc1'
                            
                                sched_getcpu() equivalent for OS X?
                            
                                How to lift a function to take an extra parameter in C?
                            
                                fast integer square root approximation
                            
                                How are chained macros resolved in C?
                            
                                Understanding of C programming syntax [duplicate]
                            
                                Sending hex code
                            
                                Force read of volatile variable
                            
                                C produce error if no argument is given in command line
                            
                                Why in vulkan.h non dispatchable objects are always typedefed to 64bit?
                            
                                How do I reduce input lag in an NCurses C Application
                            
                                MPI Send and Recv Hangs with Buffer Size Larger Than 64kb
                            
                                C preprocessing fails to stop immediately after an #error
                            
                                call a vararg function with an array?
                            
                                Segmentation fault when using regexec/strtok_r in C

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

GCC 5.1 Loop unrolling

Tags:

c

loops

compiler-optimization

gcc

loop-unrolling

surrz

People also ask

1 Answers

Pyves

Recent Activity

Donate For Us