Imagine the assembly code that would be generated from:
if (__builtin_expect(x, 0)) {
foo();
...
} else {
bar();
...
}
I guess it should be something like:
cmp $x, 0
jne _foo
_bar:
call bar
...
jmp after_if
_foo:
call foo
...
after_if:
You can see that the instructions are arranged in such an order that the bar
case precedes the foo
case (as opposed to the C code). This can utilise the CPU pipeline better, since a jump thrashes the already fetched instructions.
Before the jump is executed, the instructions below it (the bar
case) are pushed to the pipeline. Since the foo
case is unlikely, jumping too is unlikely, hence thrashing the pipeline is unlikely.
Let's decompile to see what GCC 4.8 does with it
Blagovest mentioned branch inversion to improve the pipeline, but do current compilers really do it? Let's find out!
Without __builtin_expect
#include "stdio.h"
#include "time.h"
int main() {
/* Use time to prevent it from being optimized away. */
int i = !time(NULL);
if (i)
puts("a");
return 0;
}
Compile and decompile with GCC 4.8.2 x86_64 Linux:
gcc -c -O3 -std=gnu11 main.c
objdump -dr main.o
Output:
0000000000000000 <main>:
0: 48 83 ec 08 sub $0x8,%rsp
4: 31 ff xor %edi,%edi
6: e8 00 00 00 00 callq b <main+0xb>
7: R_X86_64_PC32 time-0x4
b: 48 85 c0 test %rax,%rax
e: 75 0a jne 1a <main+0x1a>
10: bf 00 00 00 00 mov $0x0,%edi
11: R_X86_64_32 .rodata.str1.1
15: e8 00 00 00 00 callq 1a <main+0x1a>
16: R_X86_64_PC32 puts-0x4
1a: 31 c0 xor %eax,%eax
1c: 48 83 c4 08 add $0x8,%rsp
20: c3 retq
The instruction order in memory was unchanged: first the puts
and then retq
return.
With __builtin_expect
Now replace if (i)
with:
if (__builtin_expect(i, 0))
and we get:
0000000000000000 <main>:
0: 48 83 ec 08 sub $0x8,%rsp
4: 31 ff xor %edi,%edi
6: e8 00 00 00 00 callq b <main+0xb>
7: R_X86_64_PC32 time-0x4
b: 48 85 c0 test %rax,%rax
e: 74 07 je 17 <main+0x17>
10: 31 c0 xor %eax,%eax
12: 48 83 c4 08 add $0x8,%rsp
16: c3 retq
17: bf 00 00 00 00 mov $0x0,%edi
18: R_X86_64_32 .rodata.str1.1
1c: e8 00 00 00 00 callq 21 <main+0x21>
1d: R_X86_64_PC32 puts-0x4
21: eb ed jmp 10 <main+0x10>
The puts
was moved to the very end of the function, the retq
return!
The new code is basically the same as:
int i = !time(NULL);
if (i)
goto puts;
ret:
return 0;
puts:
puts("a");
goto ret;
This optimization was not done with -O0
.
But good luck on writing an example that runs faster with __builtin_expect
than without, CPUs are really smart those days. My naive attempts are here.
C++20 [[likely]]
and [[unlikely]]
C++20 has standardized those C++ built-ins: How to use C++20's likely/unlikely attribute in if-else statement They will likely (a pun!) do the same thing.
The idea of __builtin_expect
is to tell the compiler that you'll usually find that the expression evaluates to c, so that the compiler can optimize for that case.
I'd guess that someone thought they were being clever and that they were speeding things up by doing this.
Unfortunately, unless the situation is very well understood (it's likely that they have done no such thing), it may well have made things worse. The documentation even says:
In general, you should prefer to use actual profile feedback for this (
-fprofile-arcs
), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect.
In general, you shouldn't be using __builtin_expect
unless:
Well, as it says in the description, the first version adds a predictive element to the construction, telling the compiler that the x == 0
branch is the more likely one - that is, it's the branch that will be taken more often by your program.
With that in mind, the compiler can optimize the conditional so that it requires the least amount of work when the expected condition holds, at the expense of maybe having to do more work in case of the unexpected condition.
Take a look at how conditionals are implemented during the compilation phase, and also in the resulting assembly, to see how one branch may be less work than the other.
However, I would only expect this optimization to have noticeable effect if the conditional in question is part of a tight inner loop that gets called a lot, since the difference in the resulting code is relatively small. And if you optimize it the wrong way round, you may well decrease your performance.
I don't see any of the answers addressing the question that I think you were asking, paraphrased:
Is there a more portable way of hinting branch prediction to the compiler.
The title of your question made me think of doing it this way:
if ( !x ) {} else foo();
If the compiler assumes that 'true' is more likely, it could optimize for not calling foo()
.
The problem here is just that you don't, in general, know what the compiler will assume -- so any code that uses this kind of technique would need to be carefully measured (and possibly monitored over time if the context changes).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With