gcc 12 generate worse assemble code in finding the min element in a vector than gcc 8

Question

Test code:

#include <vector>
#include <cstdint>
#include <algorithm>

uint32_t find_min(const std::vector<uint32_t>& v)
{
    return std::distance(v.begin(), std::min_element(v.begin(), v.end()));
}

Find the index of the min element in a vector.

GCC 8 generated code:

find_min(std::vector<unsigned int, std::allocator<unsigned int> > const&):
        mov     rcx, QWORD PTR [rdi+8]
        mov     rsi, QWORD PTR [rdi]
        xor     eax, eax
        cmp     rcx, rsi
        je      .L1
        lea     rax, [rsi+4]
        mov     rdx, rsi
        cmp     rcx, rax
        je      .L4
.L6:
        mov     edi, DWORD PTR [rdx]
        cmp     DWORD PTR [rax], edi
        cmovb   rdx, rax
        add     rax, 4
        cmp     rcx, rax
        jne     .L6
.L4:
        mov     rax, rdx
        sub     rax, rsi
        sar     rax, 2
.L1:
        ret

GCC 12 generated code:

find_min(std::vector<unsigned int, std::allocator<unsigned int> > const&):
        mov     r8, QWORD PTR [rdi+8]
        mov     rdi, QWORD PTR [rdi]
        xor     eax, eax
        cmp     rdi, r8
        je      .L1
        lea     rdx, [rdi+4]
        cmp     r8, rdx
        je      .L1
        mov     esi, DWORD PTR [rdi]
        mov     rax, rdi
.L4:
        mov     ecx, DWORD PTR [rdx]
        cmp     ecx, esi
        jnb     .L3
        mov     esi, ecx
        mov     rax, rdx
.L3:
        add     rdx, 4
        cmp     r8, rdx
        jne     .L4
        sub     rax, rdi
        sar     rax, 2
.L1:
        ret

Both compiler use -O3 flag, use Compiler Explorer to generated these code: https://godbolt.org/z/ehv9Mv584

We can see GCC 8 generated code use cmovb, and GCC 12 generated code use a branch. In my real test, GCC 8 generated code run faster.

My question is, is there any way to let GCC 12 generate the similar assemble code like GCC 8, using cmovb rather than branch?

I found a solution, disable -ftree-partial-pre flag solve this problem:

uint32_t __attribute__((optimize("no-tree-partial-pre"))) find_min(const std::vector<uint32_t>& v) {....}

Can anybody help explain how partial redundancy elimination (PRE) work in this case?

Peter Cordes · Accepted Answer

This seems to be a tuning choice / cost-model heuristic¹. As @HolyBlackCat noticed, -march=znver3 (which implies the same -mtune) uses 2x cmovb to update the value and the pointer. (Godbolt). -march=icelake-server or even -march=sapphirerapids still makes branchy asm.

That branchless asm is probably better on Broadwell and later as well, if the branch doesn't predict well, so this might be worth reporting on https://gcc.gnu.org/bugzilla/ with a "missed-optimization" keyword, especially if you test it for non-tiny array sizes and find it is faster.

Hopefully profile-guided optimization (PGO: -fprofile-generate / test run / -fprofile-use) will make a good choice of branchy vs. branchless based on the actual predictability of your data. Perhaps it's optimizing for large arrays on the assumption that a new max will be rare (which would be true for uniformly distributed random values.) But it laid out the branch so it's taken in that case, vs. putting the block of mov instructions out-of-line. Related: gcc optimization flag -O3 makes code slower than -O2 re: branchy vs. branchless where branchless was slower because the condition was predictable; PGO helped there, if I recall. (And where GCC's branchless asm had longer dependency chains than necessary.)

GCC8's branchless code-gen would be bad for non-tiny arrays, so it's a good thing modern GCC doesn't ever make that. Notice that it only uses one cmov to update a pointer, and loads from that conditionally-updated pointer in the cmp. So the load address next iteration depends on the compare + cmov this iteration. As uiCA shows us, this data dependency creates a latency bottleneck of 7 cycles per iteration.

In C terms, it's like biggest_ptr = (*ptr > *biggest_ptr) ? ptr : biggest_ptr;

-fno-tree-partial-pre gives you that one-cmov GCC8 code-gen. (Which is fine for your case where out-of-order exec can hide the latency so the execution overlaps with surrounding code.) It's a disaster for larger arrays.

GCC12's branchless asm with two cmovb instructions only uses the updated max-value in the next iteration, not the pointer. So the pointer-update cmov is part of a loop-carried dependency chain that's only 1 cycle long. But the value-update cmovb forms a chain with cmp esi, ecx, for a total latency bottleneck of 2 cycles per iteration. (uiCA has a dependency graph visualizer that was helpful in noticing why it wasn't 1 cycle for that version.)

uiCA predicts the branchy version can run at 1.33 cycles per iteration on Rocket Lake assuming perfect branch prediction. (Which is like Ice Lake but with mov-elimination not disabled.) It predicts 2.0 cycles / iter on Skylake and Ice Lake, but doesn't explain why. Its RKL prediction might only make sense when there's a new max every iteration. Its counts assume that the mov instructions actually ran. But on Skylake, it's saying first cmp/jcc can only run on port 6, not port 0, which would only be true if it was taken. uiCA (and other tools like IACA and I think LLVM-MCA) only really work well for branchless loops so I'm not sure this is meaningful.

Footnote 1: GCC has a --param option to tune the knobs on internal heuristics. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html I wouldn't recommend manually playing with these except as an experiment, but that's probably one of the answers to your question about whether there's any GCC option to change the code-gen.

--param max-rtl-if-conversion-predictable-cost=something, max-rtl-if-conversion-insns=something or max-rtl-if-conversion-unpredictable-cost=something are probably relevant since if-conversion (to branchless) is the optimization in question.

The -mtune=znver3 implied by -march=znver3 presumably sets different values for those or other options than -mtune=skylake or the default generic.

gcc 12 generate worse assemble code in finding the min element in a vector than gcc 8

Tags:

c++

compiler-optimization

gcc

Mingfei Gao

1 Answers

Peter Cordes

Recent Activity

Donate For Us

gcc 12 generate worse assemble code in finding the min element in a vector than gcc 8

Tags:

c++

compiler-optimization

gcc

Mingfei Gao

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us