Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimize nested if statements within a loop in C/C++ with GCC

I am testing various optimizations in C/C++ using the GCC compiler. I currently have a loop with multiple nested if statements. The conditions are computed at the beginning of the execution of the program. It looks somewhat like this:

bool conditionA = getA();
bool conditionB = getB();
bool conditionC = getC();
//Etc.

startTiming();

do {
    if(conditionA) {
        doATrueStuff();
        if(conditionB) {
            //Etc.
        } else {
            //Etc.
        }
    } else {
        doAFalseStuff();
        if(conditionB) {
            //Etc.
        } else {
            //Etc.
        }
    }
} while (testCondition());

endTiming();

Where doATrueStuff() is an inline function that does some simple numerical computation so there is no overhead in calling it.

Unfortunately, the conditions cannot be defined beforehand, they have to be computed during runtime. We can't even reliably predict the chance of them being true or wrong. getA() might as well be rand()%2. But once computed, their value never changes.

There are two solutions I've thought of, one being global function pointers that are used to call the appropriate function within the loop, like this:

void (*ptrA)(void);
//Etc.

int main(int argc, char **argv) {
    //...
    if (conditionA) {
        ptrA=&aTrueFunc;
    } else {
        ptrA=&aFalseFunc;
    }
    //...
    do {
        (*ptrA)();
    } while (testCondition());
    //...
}

That way I can eliminate all branches from the loop, however then I will have the overhead of multiple function calls slowing me down.

Or I could simply have a different loop for each combination of conditions, something like this:

if(conditionA) {
    if(conditionB) {
        do {
            //Do A == true B == true stuff
        } while (testCondition());
    } else {
        do {
            //Do A == true B == false stuff
        } while (testCondition());
    }
} else {
    //Etc.
}

However that is a lot less elegant and gets impossible for one to do so efficiently once one starts having too many conditions, since for X conditions one needs to write 2^X loops.

Is there a more elegant/faster way to optimize this?

Is there even any point in this or will the compiler somehow understand that the condition doesn't change during the loop and optimize it itself?

And out of curiosity, is there another programming language that would make writing such code easier/possible? Or would that only be possible by using assembly to change the instructions of the program once its loaded into memory?

like image 273
Parisbre56 Avatar asked May 15 '15 16:05

Parisbre56


3 Answers

The Theory:

Trying to optimize your code through some wacky rewriting might make it difficult for the compiler to make its usual optimizations. The compiler and also the processor can optimize the code using 2 techniques:

  1. Branch prediction: The compiler can do that by using profile guided optimizations, mainly by estimating the probability of each branch. The CPU has also branch target buffers that try to detect the branching pattern, in addition to calculating statistics for each target.
  2. Branch predication: The compiler or CPU will make the code execute both branches in parallel (because nowadays processors are superscalar) and based on the condition result, it will just disregard the results of the incorrect path (e.g. CMOV instruction). You can try to disable branch predication using: -fno-if-conversion and -fno-if-conversion2. This might help if there is much computation on each branch and executing all paths will lead to a waste of instruction decoders and execution ports.

As a simple developer, using gcc, you can also help branch prediction or code generation using the "likely" and "unlikely" compilation hints. Check here for more details. This might work if you know for example that one condition is more likely to happen than another.

To see the branch prediction efficiency, use perf stat ./binary and check out the branch miss ratio, and the number of branch misses for each optimization you do.

In your code case:

If conditionA, conditionB and conditionC are computed before the loop, and do not change, then it is easy for the branch predictor to detect the pattern. The CPU's predictor does that by keeping track of the last branches taken/not taken and it will use the recorded history to predict the following branches. So I actually expect very little performance penalty due to branches in your code, which you can verify as above.

like image 178
VAndrei Avatar answered Oct 13 '22 06:10

VAndrei


Consider templates. The challenge is in mapping runtime values to compile-time template parameters. The boilerplate below is one dispatch function per parameter, and the compiler will create the tree of combinations for you. Not exactly elegant, but scales much better than open-coding a multi-parameter switchyard.

You can also use the template parameters (or functions of them) directly in your computations, and those will be optimized out as well, for example choosing a constant based on a template parameter, or multiplying a 0 into an expression term that you don't want to contribute.

template <bool B0, bool B1, bool B2>
void doStuffStage3()
{
    // Once you get here, you can use B0, B1, and B2 in
    // any expressions you want, in the inner loop, and the compiler
    // will optimize everything out since they're known compile-time.  Basically,
    // the compiler will create separate versions of this function
    // for all required combinations of the input
    do {
        if(B0) {

        } else {

        }
    } while(testCondition());
}

template <bool B0, bool B1>
void doStuffStage2(bool b2)
{
    if(b2) doStuffStage3<B0,B1,true>();
    else   doStuffStage3<B0,B1,false>();
}

template <bool B0>
void doStuffStage1(bool b1, bool b2)
{
    if(b1) doStuffStage2<B0,true> (b2);
    else   doStuffStage2<B0,false>(b2);
}

void doStuff(bool b0, bool b1, bool b2)
{
    if(b0) doStuffStage1<true> (b1, b2);
    else   doStuffStage1<false>(b1, b2);
}

int main()
{
    doStuff(getA(), getB(), getC());
}
like image 43
Peter Avatar answered Oct 13 '22 08:10

Peter


Quick update of 2019.

If performance is concerned, you want your assembly code to be written with the "if" OUTSIDE your for loop. Effect of an "if statement" inside a loop could be important, even with the best branch prediction. CPU will execute 2 more instructions (a "cmp" and a "jump") on each loop. Suppose you are working with large images, and your loop walks through all the pixels of the image, this can become a lot of cpu cycles.

However, if you write the code the way you do it (first code you show), an optimized (-03) gcc will actually put the conditions outside your loop and copy the almost same code in each branch to prevent having an inefficient if inside your loop. Basically gcc is smart enough to write the output of your third code when you lazily write the first :-). At least with two conditions. I did not make the exercise with more than 2 conditions.

This behaviour is actually called loop unswitching : https://en.wikipedia.org/wiki/Loop_unswitching

// Disassemblies can be generated with
//  gcc -DLAZY_WRITING -O3 -c -S main.c -o lazy.s
//  gcc -O3 -c -S main.c -o notlazy.s
// -O3 is important as otherwise the condition appears in the loop
#ifdef LAZY_WRITING /* gcc will optimize*/
int do_that_big_loops()
{
    int i;
    int condition1 = get_condition1();
    int condition2 = get_condition2();
    int len = 10000;
    for (i =0; i<len+1; i++)
    {
        call_my_func_always(i);
        if (condition1)
        {
            if (condition2)
                call_my_func_c1_c2(i);
            else
                call_my_func_c1_nc2(i);
        }
        else
        {
            if (condition2)
            {
                call_my_func_nc1_c2(i);
            }
            else
            {
                call_my_func_nc1_nc2(i);
            }
        }
    }
    return 0;
}
#else /* human-optimization */
int do_that_big_loops()
{
    int i;
    int condition1 = get_condition1();
    int condition2 = get_condition2();
    int len = 10000;
    if (condition1 && condition2)
    {
        for (i =0; i<len+1; i++)
        {
            call_my_func_always(i);
            call_my_func_c1_c2(i);
        }
    }
    else if (condition1 && !condition2)
    {
        for (i =0; i<len+1; i++)
        {
            call_my_func_always(i);
            call_my_func_c1_nc2(i);
        }
    }
    else if (!condition1 && condition2)
    {
        for (i =0; i<len+1; i++)
        {
            call_my_func_always(i);
            call_my_func_nc1_c2(i);
        }
    }
    else // (!condition1 && !condition2)
    {
        for (i =0; i<len+1; i++)
        {
            call_my_func_always(i);
            call_my_func_nc1_nc2(i);
        }
    }
    return 0;
}
#endif

Below is the disassembly of the lazy version. It is almost be the same as the non-lazy one (not included in the post, feel free to generate it with provided gcc commands). You will see 4 different calls to call_my_func_always() although only one is actually written in the code.

    .file   "main.c"
    .section    .text.unlikely,"ax",@progbits
.LCOLDB0:
    .text
.LHOTB0:
    .p2align 4,,15
    .globl  do_that_big_loops
    .type   do_that_big_loops, @function
do_that_big_loops:
.LFB0:
    .cfi_startproc
    pushq   %rbx
    .cfi_def_cfa_offset 16
    .cfi_offset 3, -16
    xorl    %eax, %eax
    call    get_condition1
    movl    %eax, %ebx
    xorl    %eax, %eax
    call    get_condition2
    testl   %ebx, %ebx
    jne .L2
    testl   %eax, %eax
    je  .L4
    xorl    %ebx, %ebx
    .p2align 4,,10
    .p2align 3
.L6:
    movl    %ebx, %edi
    xorl    %eax, %eax
    call    call_my_func_always
    movl    %ebx, %edi
    xorl    %eax, %eax
    addl    $1, %ebx
    call    call_my_func_nc1_c2
    cmpl    $10001, %ebx
    jne .L6
.L5:
    xorl    %eax, %eax
    popq    %rbx
    .cfi_remember_state
    .cfi_def_cfa_offset 8
    ret
    .p2align 4,,10
    .p2align 3
.L4:
    .cfi_restore_state
    movl    %ebx, %edi
    xorl    %eax, %eax
    call    call_my_func_always
    movl    %ebx, %edi
    xorl    %eax, %eax
    addl    $1, %ebx
    call    call_my_func_nc1_nc2
    cmpl    $10001, %ebx
    jne .L4
    jmp .L5
    .p2align 4,,10
    .p2align 3
.L2:
    xorl    %ebx, %ebx
    testl   %eax, %eax
    jne .L9
    .p2align 4,,10
    .p2align 3
.L8:
    movl    %ebx, %edi
    xorl    %eax, %eax
    call    call_my_func_always
    movl    %ebx, %edi
    xorl    %eax, %eax
    addl    $1, %ebx
    call    call_my_func_c1_nc2
    cmpl    $10001, %ebx
    jne .L8
    jmp .L5
    .p2align 4,,10
    .p2align 3
.L9:
    movl    %ebx, %edi
    xorl    %eax, %eax
    call    call_my_func_always
    movl    %ebx, %edi
    xorl    %eax, %eax
    addl    $1, %ebx
    call    call_my_func_c1_c2
    cmpl    $10001, %ebx
    jne .L9
    jmp .L5
    .cfi_endproc
.LFE0:
    .size   do_that_big_loops, .-do_that_big_loops
    .section    .text.unlikely
.LCOLDE0:
    .text
.LHOTE0:
    .ident  "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609"
    .section    .note.GNU-stack,"",@progbits
like image 29
Charles B Avatar answered Oct 13 '22 06:10

Charles B