Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unneccessary pop instructions in functions with early if statement

while playing around with godbolt.org I noticed that gcc (6.2, 7.0 snapshot), clang (3.9) and icc (17) when compiling something close to

int a(int* a, int* b) {
  if (b - a < 2) return *a = ~*a;

  // register intensive code here e.g. sorting network
}

compiles (-O2/-O3) this into somthing like this:

    push    r15
    mov     rax, rcx
    push    r14
    sub     rax, rdx
    push    r13
    push    r12
    push    rbp
    push    rbx
    sub     rsp, 184
    mov     QWORD PTR [rsp], rdx
    cmp     rax, 7
    jg      .L95
    not     DWORD PTR [rdx]
 .L162:
    add     rsp, 184
    pop     rbx
    pop     rbp
    pop     r12
    pop     r13
    pop     r14
    pop     r15
    ret

which obviously has a huge overhead in case of b - a < 2. In case of -Os gcc compiles to:

    mov     rax, rcx
    sub     rax, rdx
    cmp     rax, 7
    jg      .L74
    not     DWORD PTR [rdx]
    ret
.L74:

Which leads me to beleave that there is no code keeping the compiler from emitting this shorter code.

Is there a reason why compilers do this ? Is there a way to get them compiling to the shorter version without compiling for size?


Here's an example on Godbolt that reproduces this. It seems to have something to do with the complex part being recursive

like image 537
Christoph Diegelmann Avatar asked Oct 20 '16 08:10

Christoph Diegelmann


1 Answers

This is a known compiler limitation, see my comments on the question. IDK why it exists; maybe it's hard for compilers to decide what they can do without spilling when they haven't finished saving regs yet.

Pulling the early-out check into a wrapper is often useful when it's small enough to inline.


Looks like modern gcc can actually sidestep this compiler limitation sometimes.

Using your example on the Godbolt compiler explorer, adding a second caller is enough to get even gcc6.1 -O2 to split the function for you, so it can inline the early-out into the second caller and into the externally visible square() (which ends with jmp square(int*, int*) [clone .part.3] if the early-out return path isn't taken).

code on Godbolt, note I added -std=gnu++14, which is required for clang to compiler your code.

void square_inlinewrapper(int* a, int* b) {
  //if (b - a < 16) return;  // gcc inlines this part for us, and calls a private clone of the function!

  return square(a, b);
}

# gcc6.1 -O2  (default / generic -march= and -mtune=)
    mov     rax, rsi
    sub     rax, rdi
    cmp     rax, 63
    jg      .L9
    rep ret
.L9:
    jmp     square(int*, int*) [clone .part.3]

square() itself compiles to the same thing, calling the private clone which has the bulk of the code. The recursive calls from inside the clone call the wrapper function, so they don't do the extra push/pop work when it's not needed.


Even gcc7 doesn't do this when there's no other caller, even at -O3. It does still transform one of the recursive calls into a loop, but the other one just calls the big function again.


Clang 3.9 and icc17 don't clone the function, either, so you should write the inlineable wrapper manually (and change the main body of the function to use it for recursive calls, if the check is needed there).

You might want to name the wrapper square, and rename just the main body to a private name (like static void square_impl).

like image 105
Peter Cordes Avatar answered Oct 14 '22 21:10

Peter Cordes