while playing around with godbolt.org I noticed that gcc (6.2, 7.0 snapshot), clang (3.9) and icc (17) when compiling something close to
int a(int* a, int* b) {
if (b - a < 2) return *a = ~*a;
// register intensive code here e.g. sorting network
}
compiles (-O2/-O3) this into somthing like this:
push r15
mov rax, rcx
push r14
sub rax, rdx
push r13
push r12
push rbp
push rbx
sub rsp, 184
mov QWORD PTR [rsp], rdx
cmp rax, 7
jg .L95
not DWORD PTR [rdx]
.L162:
add rsp, 184
pop rbx
pop rbp
pop r12
pop r13
pop r14
pop r15
ret
which obviously has a huge overhead in case of b - a < 2. In case of -Os gcc compiles to:
mov rax, rcx
sub rax, rdx
cmp rax, 7
jg .L74
not DWORD PTR [rdx]
ret
.L74:
Which leads me to beleave that there is no code keeping the compiler from emitting this shorter code.
Is there a reason why compilers do this ? Is there a way to get them compiling to the shorter version without compiling for size?
Here's an example on Godbolt that reproduces this. It seems to have something to do with the complex part being recursive
This is a known compiler limitation, see my comments on the question. IDK why it exists; maybe it's hard for compilers to decide what they can do without spilling when they haven't finished saving regs yet.
Pulling the early-out check into a wrapper is often useful when it's small enough to inline.
Looks like modern gcc can actually sidestep this compiler limitation sometimes.
Using your example on the Godbolt compiler explorer, adding a second caller is enough to get even gcc6.1 -O2 to split the function for you, so it can inline the early-out into the second caller and into the externally visible square()
(which ends with jmp square(int*, int*) [clone .part.3]
if the early-out return path isn't taken).
code on Godbolt, note I added -std=gnu++14
, which is required for clang to compiler your code.
void square_inlinewrapper(int* a, int* b) {
//if (b - a < 16) return; // gcc inlines this part for us, and calls a private clone of the function!
return square(a, b);
}
# gcc6.1 -O2 (default / generic -march= and -mtune=)
mov rax, rsi
sub rax, rdi
cmp rax, 63
jg .L9
rep ret
.L9:
jmp square(int*, int*) [clone .part.3]
square()
itself compiles to the same thing, calling the private clone which has the bulk of the code. The recursive calls from inside the clone call the wrapper function, so they don't do the extra push/pop work when it's not needed.
Even gcc7 doesn't do this when there's no other caller, even at -O3. It does still transform one of the recursive calls into a loop, but the other one just calls the big function again.
Clang 3.9 and icc17 don't clone the function, either, so you should write the inlineable wrapper manually (and change the main body of the function to use it for recursive calls, if the check is needed there).
You might want to name the wrapper square
, and rename just the main body to a private name (like static void square_impl
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With