Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does gcc automatically "unroll" if-statements?

Say I have a loop that looks like this:

for(int i = 0; i < 10000; i++) {
    /* Do something computationally expensive */
    if (i < 200 && !(i%20)) {
        /* Do something else */
    }
}

wherein some trivial task gets stuck behind an if-statement that only runs a handful of times. I've always heard that "if-statements in loops are slow!" So, in the hopes of (marginally) increased performance, I split the loops apart into:

for(int i = 0; i < 200; i++) {
    /* Do something computationally expensive */
    if (!(i%20)) {
        /* Do something else */
    }
}

for(int i = 200; i < 10000; i++) {
    /* Do something computationally expensive */
}

Will gcc (with the appropriate flags, like -O3) automatically break the one loop into two, or does it only unroll to decrease the number of iterations?

like image 579
Connor Glosser Avatar asked Feb 16 '11 14:02

Connor Glosser


People also ask

How effective is loop unrolling?

Since loop unrolling is a tradeoff between code size and speed, the effectiveness of loop unrolling is highly dependent on the loop unrolling factor, that is, the number of times the loop is unrolled. As the factor increases, code size increases, leading to potential issues with the front end, such as icache misses.

What are Funroll loops?

With -funroll-loops the compiler heuristically decides which loops to unroll. If you want to force unrolling you can use -funroll-all-loops , but it usually makes the code run slower.


1 Answers

Why not just disassemble the program and see for yourself? But here we go. This is the testprogram:

int main() {
    int sum = 0;
    int i;
    for(i = 0; i < 10000; i++) {
        if (i < 200 && !(i%20)) {
            sum += 0xC0DE;
        }
        sum += 0xCAFE;
    }
    printf("%d\n", sum);
    return 0;
}

and this is the interesting part of the disassembled code compiled with gcc 4.3.3 and -o3:

0x08048404 <main+20>:   xor    ebx,ebx
0x08048406 <main+22>:   push   ecx
0x08048407 <main+23>:   xor    ecx,ecx
0x08048409 <main+25>:   sub    esp,0xc
0x0804840c <main+28>:   lea    esi,[esi+eiz*1+0x0]
0x08048410 <main+32>:   cmp    ecx,0xc7
0x08048416 <main+38>:   jg     0x8048436 <main+70>
0x08048418 <main+40>:   mov    eax,ecx
0x0804841a <main+42>:   imul   esi
0x0804841c <main+44>:   mov    eax,ecx
0x0804841e <main+46>:   sar    eax,0x1f
0x08048421 <main+49>:   sar    edx,0x3
0x08048424 <main+52>:   sub    edx,eax
0x08048426 <main+54>:   lea    edx,[edx+edx*4]
0x08048429 <main+57>:   shl    edx,0x2
0x0804842c <main+60>:   cmp    ecx,edx
0x0804842e <main+62>:   jne    0x8048436 <main+70>
0x08048430 <main+64>:   add    ebx,0xc0de
0x08048436 <main+70>:   add    ecx,0x1
0x08048439 <main+73>:   add    ebx,0xcafe
0x0804843f <main+79>:   cmp    ecx,0x2710
0x08048445 <main+85>:   jne    0x8048410 <main+32>
0x08048447 <main+87>:   mov    DWORD PTR [esp+0x8],ebx
0x0804844b <main+91>:   mov    DWORD PTR [esp+0x4],0x8048530
0x08048453 <main+99>:   mov    DWORD PTR [esp],0x1
0x0804845a <main+106>:  call   0x8048308 <__printf_chk@plt>

So as we see, for this particular example, no it does not. We have only one loop starting at main+32 and ending at main+85. If you've got problems reading the assembly code ecx = i; ebx = sum.

But still your mileage may vary - who knows what heuristics are used for this particular case, so you'll have to compile the code you've got in mind and see how longer/more complicated computations influence the optimizer.

Though on any modern CPU the branch predictor will do pretty good on such easy code, so you won't see much performance losses in either case. What's the performance loss of maybe a handful mispredictions if your computation intense code needs billions of cycles?

like image 99
Voo Avatar answered Oct 23 '22 11:10

Voo