I know the kernel uses the <code>likely</code> and <code>unlikely</code> macros prodigiously. The docs for the macros are located at Built-in Function: long __builtin_expect (long exp, long c). But they don't really discuss the details. How exactly does a compiler handle <code>likely(x)</code> and <code>__builtin_expect((x),1)</code>? Is it handled by the code generator or the optimizer? Does it depend upon optimization levels? What's an example of the code generated?

I just tested a simple example on gcc. For x86 this seems to be handled by the optimizer and depend on optimization levels. Although I guess a correct answer here would be "it depends on the compiler". The code generated is CPU dependent. Some cpus (sparc64 comes immediately to my mind, but I'm sure there are others) have flags on conditional branch instructions that tell the CPU how to predict it, so the compiler generates "predict true/predict false" instructions depending on the built in rules in the compiler and hints from the code (like <code>__builtin_expect</code>). Intel documents their behavior here: https://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts . In short the behavior on Intel CPUs is that if the CPU has no previous information about a branch it will predict forward branches as unlikely to be taken, while backwards branches are likely to be taken (think about loops vs. error handling). This is some example code: <pre class="prettyprint"><code>int bar(int); int foo(int x) { if (__builtin_expect(x>10, PREDICTION)) return bar(10); return 42; } </code></pre> Compiled with (I'm using omit-frame-pointer to make the output more readable, but I still cleaned it up below): <pre class="prettyprint"><code>$ cc -S -fomit-frame-pointer -O0 -DPREDICTION=0 -o 00.s foo.c $ cc -S -fomit-frame-pointer -O0 -DPREDICTION=1 -o 01.s foo.c $ cc -S -fomit-frame-pointer -O2 -DPREDICTION=0 -o 20.s foo.c $ cc -S -fomit-frame-pointer -O2 -DPREDICTION=1 -o 21.s foo.c </code></pre> There's no difference between 00.s and 01.s, so that shows that this is dependent on optimization (for gcc at least). Here's the (cleaned up) generated code for 20.s: <pre class="prettyprint"><code>foo: cmpl $10, %edi jg .L2 movl $42, %eax ret .L2: movl $10, %edi jmp bar </code></pre> And here is 21.s: <pre class="prettyprint"><code>foo: cmpl $10, %edi jle .L6 movl $10, %edi jmp bar .L6: movl $42, %eax ret </code></pre> As expected the compiler rearranged the code so that the branch we don't expect to take is done in a forward branch.

likely(x) and __builtin_expect((x),1)

Tags:

c

optimization

macros

code-generation

built-in

I know the kernel uses the likely and unlikely macros prodigiously. The docs for the macros are located at Built-in Function: long __builtin_expect (long exp, long c). But they don't really discuss the details.

How exactly does a compiler handle likely(x) and __builtin_expect((x),1)?

Is it handled by the code generator or the optimizer?

Does it depend upon optimization levels?

What's an example of the code generated?

291

asked Jun 19 '14 07:06

jww

1 Answers

I just tested a simple example on gcc.

For x86 this seems to be handled by the optimizer and depend on optimization levels. Although I guess a correct answer here would be "it depends on the compiler".

The code generated is CPU dependent. Some cpus (sparc64 comes immediately to my mind, but I'm sure there are others) have flags on conditional branch instructions that tell the CPU how to predict it, so the compiler generates "predict true/predict false" instructions depending on the built in rules in the compiler and hints from the code (like __builtin_expect).

Intel documents their behavior here: https://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts . In short the behavior on Intel CPUs is that if the CPU has no previous information about a branch it will predict forward branches as unlikely to be taken, while backwards branches are likely to be taken (think about loops vs. error handling).

This is some example code:

int bar(int);
int
foo(int x)
{
    if (__builtin_expect(x>10, PREDICTION))
        return bar(10);
    return 42;
}

Compiled with (I'm using omit-frame-pointer to make the output more readable, but I still cleaned it up below):

$ cc -S -fomit-frame-pointer -O0 -DPREDICTION=0 -o 00.s foo.c
$ cc -S -fomit-frame-pointer -O0 -DPREDICTION=1 -o 01.s foo.c
$ cc -S -fomit-frame-pointer -O2 -DPREDICTION=0 -o 20.s foo.c
$ cc -S -fomit-frame-pointer -O2 -DPREDICTION=1 -o 21.s foo.c

There's no difference between 00.s and 01.s, so that shows that this is dependent on optimization (for gcc at least).

Here's the (cleaned up) generated code for 20.s:

foo:
    cmpl    $10, %edi
    jg  .L2
    movl    $42, %eax
    ret
.L2:
    movl    $10, %edi
    jmp bar

And here is 21.s:

foo:
    cmpl    $10, %edi
    jle .L6
    movl    $10, %edi
    jmp bar
.L6:
    movl    $42, %eax
    ret

As expected the compiler rearranged the code so that the branch we don't expect to take is done in a forward branch.

answered Sep 22 '22 21:09

Art

Related questions
                            
                                Do extended regexes support back-references?
                            
                                why sizeof("-2147483648") - 1
                            
                                expected , or ; before if
                            
                                How to pass a 2d array through pointer in c [duplicate]
                            
                                "Blocky" Perlin noise
                            
                                SSE with doubles, not worth it?
                            
                                Making a GUI without a framework in C
                            
                                Cryptographic pseudo random number generator in embedded system?
                            
                                Forward declaration of function pointer typedef
                            
                                Running plugins in a sandbox
                            
                                How to decide using lua_call() or lua_pcall()?
                            
                                How does the UV_RUN_NOWAIT mode work in libuv?
                            
                                Is there a way to compare two different runs of a C/C++ program?
                            
                                what to do with missing libgcc_s.a
                            
                                How to filter and intercept Linux packets by using net_dev_add() API?
                            
                                Difference between sockaddr and sockaddr_storage
                            
                                Why does size always = 4096 in Linux character driver read call?
                            
                                Why Is ACCESS_ONCE so complex?
                            
                                What are pthread cancelation points used for?
                            
                                Why use define keyword to define a function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With