I am testing various optimizations in C/C++ using the GCC compiler. I currently have a loop with multiple nested if statements. The conditions are computed at the beginning of the execution of the program. It looks somewhat like this:
bool conditionA = getA();
bool conditionB = getB();
bool conditionC = getC();
//Etc.
startTiming();
do {
if(conditionA) {
doATrueStuff();
if(conditionB) {
//Etc.
} else {
//Etc.
}
} else {
doAFalseStuff();
if(conditionB) {
//Etc.
} else {
//Etc.
}
}
} while (testCondition());
endTiming();
Where doATrueStuff()
is an inline function that does some simple numerical computation so there is no overhead in calling it.
Unfortunately, the conditions cannot be defined beforehand, they have to be computed during runtime. We can't even reliably predict the chance of them being true or wrong. getA()
might as well be rand()%2
. But once computed, their value never changes.
There are two solutions I've thought of, one being global function pointers that are used to call the appropriate function within the loop, like this:
void (*ptrA)(void);
//Etc.
int main(int argc, char **argv) {
//...
if (conditionA) {
ptrA=&aTrueFunc;
} else {
ptrA=&aFalseFunc;
}
//...
do {
(*ptrA)();
} while (testCondition());
//...
}
That way I can eliminate all branches from the loop, however then I will have the overhead of multiple function calls slowing me down.
Or I could simply have a different loop for each combination of conditions, something like this:
if(conditionA) {
if(conditionB) {
do {
//Do A == true B == true stuff
} while (testCondition());
} else {
do {
//Do A == true B == false stuff
} while (testCondition());
}
} else {
//Etc.
}
However that is a lot less elegant and gets impossible for one to do so efficiently once one starts having too many conditions, since for X conditions one needs to write 2^X loops.
Is there a more elegant/faster way to optimize this?
Is there even any point in this or will the compiler somehow understand that the condition doesn't change during the loop and optimize it itself?
And out of curiosity, is there another programming language that would make writing such code easier/possible? Or would that only be possible by using assembly to change the instructions of the program once its loaded into memory?
The Theory:
Trying to optimize your code through some wacky rewriting might make it difficult for the compiler to make its usual optimizations. The compiler and also the processor can optimize the code using 2 techniques:
As a simple developer, using gcc, you can also help branch prediction or code generation using the "likely" and "unlikely" compilation hints. Check here for more details. This might work if you know for example that one condition is more likely to happen than another.
To see the branch prediction efficiency, use perf stat ./binary and check out the branch miss ratio, and the number of branch misses for each optimization you do.
In your code case:
If conditionA, conditionB and conditionC are computed before the loop, and do not change, then it is easy for the branch predictor to detect the pattern. The CPU's predictor does that by keeping track of the last branches taken/not taken and it will use the recorded history to predict the following branches. So I actually expect very little performance penalty due to branches in your code, which you can verify as above.
Consider templates. The challenge is in mapping runtime values to compile-time template parameters. The boilerplate below is one dispatch function per parameter, and the compiler will create the tree of combinations for you. Not exactly elegant, but scales much better than open-coding a multi-parameter switchyard.
You can also use the template parameters (or functions of them) directly in your computations, and those will be optimized out as well, for example choosing a constant based on a template parameter, or multiplying a 0 into an expression term that you don't want to contribute.
template <bool B0, bool B1, bool B2>
void doStuffStage3()
{
// Once you get here, you can use B0, B1, and B2 in
// any expressions you want, in the inner loop, and the compiler
// will optimize everything out since they're known compile-time. Basically,
// the compiler will create separate versions of this function
// for all required combinations of the input
do {
if(B0) {
} else {
}
} while(testCondition());
}
template <bool B0, bool B1>
void doStuffStage2(bool b2)
{
if(b2) doStuffStage3<B0,B1,true>();
else doStuffStage3<B0,B1,false>();
}
template <bool B0>
void doStuffStage1(bool b1, bool b2)
{
if(b1) doStuffStage2<B0,true> (b2);
else doStuffStage2<B0,false>(b2);
}
void doStuff(bool b0, bool b1, bool b2)
{
if(b0) doStuffStage1<true> (b1, b2);
else doStuffStage1<false>(b1, b2);
}
int main()
{
doStuff(getA(), getB(), getC());
}
Quick update of 2019.
If performance is concerned, you want your assembly code to be written with the "if" OUTSIDE your for loop. Effect of an "if statement" inside a loop could be important, even with the best branch prediction. CPU will execute 2 more instructions (a "cmp" and a "jump") on each loop. Suppose you are working with large images, and your loop walks through all the pixels of the image, this can become a lot of cpu cycles.
However, if you write the code the way you do it (first code you show), an optimized (-03) gcc will actually put the conditions outside your loop and copy the almost same code in each branch to prevent having an inefficient if inside your loop. Basically gcc is smart enough to write the output of your third code when you lazily write the first :-). At least with two conditions. I did not make the exercise with more than 2 conditions.
This behaviour is actually called loop unswitching : https://en.wikipedia.org/wiki/Loop_unswitching
// Disassemblies can be generated with
// gcc -DLAZY_WRITING -O3 -c -S main.c -o lazy.s
// gcc -O3 -c -S main.c -o notlazy.s
// -O3 is important as otherwise the condition appears in the loop
#ifdef LAZY_WRITING /* gcc will optimize*/
int do_that_big_loops()
{
int i;
int condition1 = get_condition1();
int condition2 = get_condition2();
int len = 10000;
for (i =0; i<len+1; i++)
{
call_my_func_always(i);
if (condition1)
{
if (condition2)
call_my_func_c1_c2(i);
else
call_my_func_c1_nc2(i);
}
else
{
if (condition2)
{
call_my_func_nc1_c2(i);
}
else
{
call_my_func_nc1_nc2(i);
}
}
}
return 0;
}
#else /* human-optimization */
int do_that_big_loops()
{
int i;
int condition1 = get_condition1();
int condition2 = get_condition2();
int len = 10000;
if (condition1 && condition2)
{
for (i =0; i<len+1; i++)
{
call_my_func_always(i);
call_my_func_c1_c2(i);
}
}
else if (condition1 && !condition2)
{
for (i =0; i<len+1; i++)
{
call_my_func_always(i);
call_my_func_c1_nc2(i);
}
}
else if (!condition1 && condition2)
{
for (i =0; i<len+1; i++)
{
call_my_func_always(i);
call_my_func_nc1_c2(i);
}
}
else // (!condition1 && !condition2)
{
for (i =0; i<len+1; i++)
{
call_my_func_always(i);
call_my_func_nc1_nc2(i);
}
}
return 0;
}
#endif
Below is the disassembly of the lazy version. It is almost be the same as the non-lazy one (not included in the post, feel free to generate it with provided gcc commands). You will see 4 different calls to call_my_func_always() although only one is actually written in the code.
.file "main.c"
.section .text.unlikely,"ax",@progbits
.LCOLDB0:
.text
.LHOTB0:
.p2align 4,,15
.globl do_that_big_loops
.type do_that_big_loops, @function
do_that_big_loops:
.LFB0:
.cfi_startproc
pushq %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
xorl %eax, %eax
call get_condition1
movl %eax, %ebx
xorl %eax, %eax
call get_condition2
testl %ebx, %ebx
jne .L2
testl %eax, %eax
je .L4
xorl %ebx, %ebx
.p2align 4,,10
.p2align 3
.L6:
movl %ebx, %edi
xorl %eax, %eax
call call_my_func_always
movl %ebx, %edi
xorl %eax, %eax
addl $1, %ebx
call call_my_func_nc1_c2
cmpl $10001, %ebx
jne .L6
.L5:
xorl %eax, %eax
popq %rbx
.cfi_remember_state
.cfi_def_cfa_offset 8
ret
.p2align 4,,10
.p2align 3
.L4:
.cfi_restore_state
movl %ebx, %edi
xorl %eax, %eax
call call_my_func_always
movl %ebx, %edi
xorl %eax, %eax
addl $1, %ebx
call call_my_func_nc1_nc2
cmpl $10001, %ebx
jne .L4
jmp .L5
.p2align 4,,10
.p2align 3
.L2:
xorl %ebx, %ebx
testl %eax, %eax
jne .L9
.p2align 4,,10
.p2align 3
.L8:
movl %ebx, %edi
xorl %eax, %eax
call call_my_func_always
movl %ebx, %edi
xorl %eax, %eax
addl $1, %ebx
call call_my_func_c1_nc2
cmpl $10001, %ebx
jne .L8
jmp .L5
.p2align 4,,10
.p2align 3
.L9:
movl %ebx, %edi
xorl %eax, %eax
call call_my_func_always
movl %ebx, %edi
xorl %eax, %eax
addl $1, %ebx
call call_my_func_c1_c2
cmpl $10001, %ebx
jne .L9
jmp .L5
.cfi_endproc
.LFE0:
.size do_that_big_loops, .-do_that_big_loops
.section .text.unlikely
.LCOLDE0:
.text
.LHOTE0:
.ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609"
.section .note.GNU-stack,"",@progbits
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With