Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What optimizations does __builtin_unreachable facilitate?

Judging from gcc's documentation

If control flow reaches the point of the __builtin_unreachable, the program is undefined.

I thought __builtin_unreachable may be used as a hint to the optimizer in all sorts of creative ways. So I did a little experiment

void stdswap(int& x, int& y)
{
    std::swap(x, y);
}

void brswap(int& x, int& y)
{
    if(&x == &y)
        __builtin_unreachable();
    x ^= y;
    y ^= x;
    x ^= y;
}

void rswap(int& __restrict x, int& __restrict y)
{
    x ^= y;
    y ^= x;
    x ^= y;
}

gets compiled to (g++ -O2)

stdswap(int&, int&):
        mov     eax, DWORD PTR [rdi]
        mov     edx, DWORD PTR [rsi]
        mov     DWORD PTR [rdi], edx
        mov     DWORD PTR [rsi], eax
        ret
brswap(int&, int&):
        mov     eax, DWORD PTR [rdi]
        xor     eax, DWORD PTR [rsi]
        mov     DWORD PTR [rdi], eax
        xor     eax, DWORD PTR [rsi]
        mov     DWORD PTR [rsi], eax
        xor     DWORD PTR [rdi], eax
        ret
rswap(int&, int&):
        mov     eax, DWORD PTR [rsi]
        mov     edx, DWORD PTR [rdi]
        mov     DWORD PTR [rdi], eax
        mov     DWORD PTR [rsi], edx
        ret

I assume that stdswap and rswap is optimal from the optimizer's perspective. Why doesn't brswap get compiled to the same thing? Can I get it to compile to the same thing with __builtin_unreachable?

like image 515
Passer By Avatar asked Feb 19 '19 10:02

Passer By


Video Answer


2 Answers

The purpose of __builtin_unreachable is to help the compiler to:

  • Remove dead code (that programmer knows will never be executed)
  • Linearize the code by letting compiler know that the path is "cold" (similar effect is achieved by calling noreturn function)

Consider the following:

void exit_if_true(bool x);

int foo1(bool x)
{
    if (x) {
        exit_if_true(true);
        //__builtin_unreachable(); // we do not enable it here
    } else {
        std::puts("reachable");
    }

    return 0;
}
int foo2(bool x)
{
    if (x) {
        exit_if_true(true);
        __builtin_unreachable();  // now compiler knows exit_if_true
                                  // will not return as we are passing true to it
    } else {
        std::puts("reachable");
    }

    return 0;
}

Generated code:

foo1(bool):
        sub     rsp, 8
        test    dil, dil
        je      .L2              ; that jump is going to change
        mov     edi, 1
        call    exit_if_true(bool)
        xor     eax, eax         ; that tail is going to be removed
        add     rsp, 8
        ret
.L2:
        mov     edi, OFFSET FLAT:.LC0
        call    puts
        xor     eax, eax
        add     rsp, 8
        ret
foo2(bool):
        sub     rsp, 8
        test    dil, dil
        jne     .L9              ; changed jump
        mov     edi, OFFSET FLAT:.LC0
        call    puts
        xor     eax, eax
        add     rsp, 8
        ret
.L9:
        mov     edi, 1
        call    exit_if_true(bool)

Notice the differences:

  • xor eax, eax and ret were removed as now compiler knows that is a dead code.
  • The compiler swapped the order of branches: branch with puts call now comes first so that conditional jump can be faster (forward branches that are not taken are faster both when predicted and when there is no prediction information).

The assumption here is that branch that ends with noreturn function call or __builtin_unreachable will either be executed only once or leads to longjmp call or exception throw both of which are rare and do not need to be prioritized during optimization.

You are trying to use it for a different purpose - by giving compiler information about aliasing (and you can try doing the same for alignment). Unfortunately GCC doesn't understand such address checks.

As you have noticed, adding __restrict__ helps. So __restrict__ works for aliasing, __builtin_unreachable does not.

Look at the following example that uses __builtin_assume_aligned:

void copy1(int *__restrict__ dst, const int *__restrict__ src)
{
    if (reinterpret_cast<uintptr_t>(dst) % 16 == 0) __builtin_unreachable();
    if (reinterpret_cast<uintptr_t>(src) % 16 == 0) __builtin_unreachable();
        
    dst[0] = src[0];
    dst[1] = src[1];
    dst[2] = src[2];
    dst[3] = src[3];
}

void copy2(int *__restrict__ dst, const int *__restrict__ src)
{
    dst = static_cast<int *>(__builtin_assume_aligned(dst, 16));
    src = static_cast<const int *>(__builtin_assume_aligned(src, 16));

    dst[0] = src[0];
    dst[1] = src[1];
    dst[2] = src[2];
    dst[3] = src[3];
}

Generated code:

copy1(int*, int const*):
        movdqu  xmm0, XMMWORD PTR [rsi]
        movups  XMMWORD PTR [rdi], xmm0
        ret
copy2(int*, int const*):
        movdqa  xmm0, XMMWORD PTR [rsi]
        movaps  XMMWORD PTR [rdi], xmm0
        ret

You could assume that compiler can understand that dst % 16 == 0 means the pointer is 16-byte aligned, but it doesn't. So unaligned stores and loads are used, while the second version generates faster instructions that require address to be aligned.

like image 178
StaceyGirl Avatar answered Nov 02 '22 12:11

StaceyGirl


I think you are the trying to micro-optimize your code wrong moving a wrong direction.

__builtin_unreachable as well as __builtin_expect doing what expected - in your case remove unnecessary cmp and jnz from unused if operator.

Compiler should generate the machine code using C code you've write to produce predictable program. And during optimization, it able to find and optimize (i.e. replace with better machine code version) some patterns, when it known to optimization algorithm - such optimization would not broke the program behavior.

E.g. something like

char a[100];
for(int i=0; i < 100; i++)
   a[i]  = 0;

will be replaced single call to library std::memset(a,0,100) which is implemented using assembly, and optimal for the current CPU architecture.

As well as compiler able to detect

x ^= y;
y ^= x;
x ^= y;

and replace it with simplest mashie code.

I think your if operator and unreached directive influenced the compiler optimizer, so that is can not optimize.

In case of swapping of two integers, 3-rd temporary exchange variable can be removed by compiler it self, i.e. it is going to be something like

movl    $2, %ebx
movl    $1, %eax
xchg    %eax,%ebx  

Where ebx and eax register values are actually your x and y. You can implement it your self like

void swap_x86(int& x, int& y)
{
    __asm__ __volatile__( "xchg %%rax, %%rbx": "=a"(x), "=b"(y) : "a"(x), "b"(y) : );
}
...
int a = 1;
int b = 2;
swap_x86(a,b);

When to use __builtin_unreachable? Probably when you known that some situation are practically impossible, but logically it may happens. I.e. you have some function like

void foo(int v) {

    switch( v ) {
        case 0:
            break;
        case 1:
            break;
        case 2:
            break;
        case 3:
            break;
        default:
            __builtin_unreachable();
    }
}

And you know that v argument value is always between 0 and 3. However, int range is -2147483648 to 2147483647 (when int is 32 bit type), compiler have no idea about real values range and not able to remove the default block (as well as some cmp instructions etc), but it will warn you if you don't add this block into switch. So in this case __builtin_unreachable may help.

like image 44
Victor Gubin Avatar answered Nov 02 '22 12:11

Victor Gubin