Test code:
#include <vector>
#include <cstdint>
#include <algorithm>
uint32_t find_min(const std::vector<uint32_t>& v)
{
return std::distance(v.begin(), std::min_element(v.begin(), v.end()));
}
Find the index of the min element in a vector.
GCC 8 generated code:
find_min(std::vector<unsigned int, std::allocator<unsigned int> > const&):
mov rcx, QWORD PTR [rdi+8]
mov rsi, QWORD PTR [rdi]
xor eax, eax
cmp rcx, rsi
je .L1
lea rax, [rsi+4]
mov rdx, rsi
cmp rcx, rax
je .L4
.L6:
mov edi, DWORD PTR [rdx]
cmp DWORD PTR [rax], edi
cmovb rdx, rax
add rax, 4
cmp rcx, rax
jne .L6
.L4:
mov rax, rdx
sub rax, rsi
sar rax, 2
.L1:
ret
GCC 12 generated code:
find_min(std::vector<unsigned int, std::allocator<unsigned int> > const&):
mov r8, QWORD PTR [rdi+8]
mov rdi, QWORD PTR [rdi]
xor eax, eax
cmp rdi, r8
je .L1
lea rdx, [rdi+4]
cmp r8, rdx
je .L1
mov esi, DWORD PTR [rdi]
mov rax, rdi
.L4:
mov ecx, DWORD PTR [rdx]
cmp ecx, esi
jnb .L3
mov esi, ecx
mov rax, rdx
.L3:
add rdx, 4
cmp r8, rdx
jne .L4
sub rax, rdi
sar rax, 2
.L1:
ret
Both compiler use -O3 flag, use Compiler Explorer to generated these code: https://godbolt.org/z/ehv9Mv584
We can see GCC 8 generated code use cmovb, and GCC 12 generated code use a branch. In my real test, GCC 8 generated code run faster.
My question is, is there any way to let GCC 12 generate the similar assemble code like GCC 8, using cmovb rather than branch?
I found a solution, disable -ftree-partial-pre flag solve this problem:
uint32_t __attribute__((optimize("no-tree-partial-pre"))) find_min(const std::vector<uint32_t>& v) {....}
Can anybody help explain how partial redundancy elimination (PRE) work in this case?
This seems to be a tuning choice / cost-model heuristic1. As @HolyBlackCat noticed, -march=znver3 (which implies the same -mtune) uses 2x cmovb to update the value and the pointer. (Godbolt). -march=icelake-server or even -march=sapphirerapids still makes branchy asm.
That branchless asm is probably better on Broadwell and later as well, if the branch doesn't predict well, so this might be worth reporting on https://gcc.gnu.org/bugzilla/ with a "missed-optimization" keyword, especially if you test it for non-tiny array sizes and find it is faster.
Hopefully profile-guided optimization (PGO: -fprofile-generate / test run / -fprofile-use) will make a good choice of branchy vs. branchless based on the actual predictability of your data. Perhaps it's optimizing for large arrays on the assumption that a new max will be rare (which would be true for uniformly distributed random values.) But it laid out the branch so it's taken in that case, vs. putting the block of mov instructions out-of-line. Related: gcc optimization flag -O3 makes code slower than -O2 re: branchy vs. branchless where branchless was slower because the condition was predictable; PGO helped there, if I recall. (And where GCC's branchless asm had longer dependency chains than necessary.)
GCC8's branchless code-gen would be bad for non-tiny arrays, so it's a good thing modern GCC doesn't ever make that. Notice that it only uses one cmov to update a pointer, and loads from that conditionally-updated pointer in the cmp. So the load address next iteration depends on the compare + cmov this iteration. As uiCA shows us, this data dependency creates a latency bottleneck of 7 cycles per iteration.
In C terms, it's like biggest_ptr = (*ptr > *biggest_ptr) ? ptr : biggest_ptr;
-fno-tree-partial-pre gives you that one-cmov GCC8 code-gen. (Which is fine for your case where out-of-order exec can hide the latency so the execution overlaps with surrounding code.) It's a disaster for larger arrays.
GCC12's branchless asm with two cmovb instructions only uses the updated max-value in the next iteration, not the pointer. So the pointer-update cmov is part of a loop-carried dependency chain that's only 1 cycle long. But the value-update cmovb forms a chain with cmp esi, ecx, for a total latency bottleneck of 2 cycles per iteration. (uiCA has a dependency graph visualizer that was helpful in noticing why it wasn't 1 cycle for that version.)
uiCA predicts the branchy version can run at 1.33 cycles per iteration on Rocket Lake assuming perfect branch prediction. (Which is like Ice Lake but with mov-elimination not disabled.) It predicts 2.0 cycles / iter on Skylake and Ice Lake, but doesn't explain why. Its RKL prediction might only make sense when there's a new max every iteration. Its counts assume that the mov instructions actually ran. But on Skylake, it's saying first cmp/jcc can only run on port 6, not port 0, which would only be true if it was taken. uiCA (and other tools like IACA and I think LLVM-MCA) only really work well for branchless loops so I'm not sure this is meaningful.
Footnote 1: GCC has a --param option to tune the knobs on internal heuristics. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html I wouldn't recommend manually playing with these except as an experiment, but that's probably one of the answers to your question about whether there's any GCC option to change the code-gen.
--param max-rtl-if-conversion-predictable-cost=something, max-rtl-if-conversion-insns=something or max-rtl-if-conversion-unpredictable-cost=something are probably relevant since if-conversion (to branchless) is the optimization in question.
The -mtune=znver3 implied by -march=znver3 presumably sets different values for those or other options than -mtune=skylake or the default generic.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With