Consider the following snippet of code:
int* find_ptr(int* mem, int sz, int val) {
for (int i = 0; i < sz; i++) {
if (mem[i] == val) {
return &mem[i];
}
}
return nullptr;
}
GCC on -O3 compiles this to:
find_ptr(int*, int, int):
mov rax, rdi
test esi, esi
jle .L4 # why not .L8?
lea ecx, [rsi-1]
lea rcx, [rdi+4+rcx*4]
jmp .L3
.L9:
add rax, 4
cmp rax, rcx
je .L8
.L3:
cmp DWORD PTR [rax], edx
jne .L9
ret
.L8:
xor eax, eax
ret
.L4:
xor eax, eax
ret
In this assembly, the blocks with labels .L4
and .L8
are identical. Would it not be better to rewrite jumps to .L4
to .L8
and drop .L4
? I thought this might be a bug, but clang also duplicates the xor
-ret
sequence back to back. However, ICC and MSVC each take a pretty different approach.
Is this an optimization in this case and, if not, are there times when it would be? What is the rationale behind this behavior?
This is always a missed optimizations. Having both return-0 paths use the same basic block would be pure win on all microarchitectures that current compilers care about.
But unfortunately this missed-optimization is not rare with gcc. Often it's a separate bare ret
that gcc conditionally branches to, instead of branching to a ret
in another existing path. (x86 doesn't have a conditional ret
, so simple functions that don't need any stack cleanup often just need to branch to a ret
.
Often functions this small would get inlined in a complete program, so maybe it doesn't hurt a lot in real life?)
CPUs (since Pentium Pro if not earlier) have a return-address predictor stack that easily predicts the branch target for ret
instructions, so there's not going to be an effect from one ret
instruction more often returning to one caller and another ret more often returning to another caller. It doesn't help branch prediction to separate them and let them use different entries.
IDK about Pentium 4 and whether the traces in its trace cache follow call/ret. But fortunately that's not relevant anymore. The decoded-uop cache in SnB-family and Ryzen is not a trace cache; a line/way of uop cache holds uops for a contiguous block of x86 machine code, and unconditional jumps end a uop cache line. (https://agner.org/optimize/) So if anything, this could be worse for SnB-family because each return path needs a separate line of the uop cache even though they're each only 2 uops total (xor-zero and ret are both single-uop instructions).
Report this MCVE to gcc's bugzilla with keyword missed-optimization: https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc
(update: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90178 was reported by the OP. A fix was attempted, but reverted; for now it's still open. In this case it seems to be caused by -mavx
, perhaps some interaction with return paths that need vzeroupper
or not.)
Cause:
You can kind of see how it might arrive at 2 exit blocks: compilers normally transform for
loops into if(sz>0) { do{}while(); }
if there's a possibility of it needing to run 0 times, like gcc did here. So there's one branch that leaves the function without entering the loop at all. But the other exit is from fall through from the loop. Perhaps before optimizing away some stuff, there was some extra cleanup. Or just those paths got split up when the first branch was created.
I don't know why gcc fails to notice and merge two identical basic blocks that end with ret
.
Maybe it only looked for that in some GIMPLE or RTL pass where they weren't actually identical, and only became identical during final x86 code-gen. Maybe after optimizing away save/restore of a register to hold some temporary that it ended up no needing?
You could dig deeper if you look at GCC's GIMPLE or RTL with -fdump-tree-...
options after certain optimization passes: Godbolt has UI for that, in the +
dropdown -> tree / RTL output. https://godbolt.org/z/l9mVlE. But unless you're a gcc-internals expert and planning to work on a patch or idea to help gcc find this optimization, it's probably not worth your time.
Interesting discovery that it only happens with -mavx
(enabled by -march=skylake
or directly). GCC and clang don't know how to auto-vectorize loops where the trip count is not known before the first iteration. e.g. search loops like this or memchr
or strlen
. So IDK why AVX even makes a difference at all.
(Note that the C abstract machine never reads mem[i]
beyond the search point, and those elements might not actually exist. e.g. there's no UB if you passed this function a pointer to the last int
before an unmapped page, and sz=1000
, as long as *mem == val
. So to auto-vectorize without int mem[static sz]
guaranteed object size, the compiler would have to align the pointer... Not that C11 int mem[static sz]
would even help; even a static array of compile-time-constant size larger than the max possible trip count wouldn't get gcc to auto-vectorize.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With