Why doesn't this short comparison optimize the way I expected?

Question

I have a compound index type that consists of two 16-bit integers packed together into a 32-bit object designed to be passed around and treated a bit like a pointer. But I chanced to notice that the comparison operators I defined weren't being optimized the way I expected.

Given this cut-down code:

#include <cstdint>

struct TwoParter {
    std::uint16_t blk;
    std::uint16_t ofs;
};
static_assert (sizeof(TwoParter) == sizeof(std::uint32_t), "pack densely");

bool equal1 (TwoParter const & lhs, TwoParter const & rhs) {
    return lhs.blk == rhs.blk && lhs.ofs == rhs.ofs;
}

bool equal2 (TwoParter const & lhs, TwoParter const & rhs) {
    auto lp = reinterpret_cast <std::uint32_t const *> (&lhs);
    auto rp = reinterpret_cast <std::uint32_t const *> (&rhs);
    return *lp == *rp;
}

GCC (7.1 on Compiler Explorer) produces the following assembly (options -m64 -std=c++11 -O3):

equal1(TwoParter const&, TwoParter const&):
        movzwl  (%rsi), %edx
        xorl    %eax, %eax
        cmpw    %dx, (%rdi)
        je      .L5
        rep ret
.L5:
        movzwl  2(%rsi), %eax
        cmpw    %ax, 2(%rdi)
        sete    %al
        ret
equal2(TwoParter const&, TwoParter const&):
        movl    (%rsi), %eax
        cmpl    %eax, (%rdi)
        sete    %al
        ret

One of these seems to be doing more work than the other. But I just can't see how they're different: the assertion guarantees that the layout of the struct is such that the comparison as a uint23_t must compare all of the same data as examining the uint16_t fields separately. More importantly, this is x86, so the compiler already knew that would be the case. The short-circuiting behaviour of && shouldn't be important to the output, because its right-hand operand has no effects (and the compiler can see this), and since nothing else interesting is happening I can't imagine why it would want to e.g. defer loading the second half of the data.

Replacing the && with the & operator gets rid of the jump but does not fundamentally change what the code does: it still generates two separate 16-bit comparisons, instead of comparing all of the data in one go, which indicates that the short-circuit is probably not the issue (although it does raise a related question of why it doesn't compile && and & the same way either - surely one of the two should be "better" in both cases).

What interests me is that, also according to Compiler Explorer, all the major compilers (GCC, Clang, Intel, MSVC) seem to do roughly the same thing. That reduces the chances that this is an oversight on the part of one optimizer, but what I can't see is how my own assessment of this is actually wrong.

So there are two parts to this:

1) does equal1 really do the same thing as equal2? Am I missing something crazy here?

2) if it does, why would compilers choose not to emit the shorter instruction sequence?

I'm sure that optimizations in this vein must be something compilers know about, because they would be more legitimately useful for speeding up other, more serious code like e.g. memcmp shoving things into vector registers to compare more data at once.

Jarod42 · Accepted Answer

Alignment requirement are not the same, TwoParter has same alignment than std::uint16_t.

Changing TwoParter to

struct alignas(std::uint32_t) TwoParter {
    std::uint16_t blk;
    std::uint16_t ofs;
};

produces same asm for gcc 7.1:

equal1(TwoParter const&, TwoParter const&):
        movl    (%rsi), %eax
        cmpl    %eax, (%rdi)
        sete    %al
        ret
equal2(TwoParter const&, TwoParter const&):
        movl    (%rsi), %eax
        cmpl    %eax, (%rdi)
        sete    %al
        ret

Demo

Why doesn't this short comparison optimize the way I expected?

Tags:

c++

optimization

assembly

Leushenko

1 Answers

Jarod42

Recent Activity

Donate For Us

Why doesn't this short comparison optimize the way I expected?

Tags:

c++

optimization

assembly

Leushenko

1 Answers

Jarod42

Related questions

Recent Activity

Donate For Us