I have a compound index type that consists of two 16-bit integers packed together into a 32-bit object designed to be passed around and treated a bit like a pointer. But I chanced to notice that the comparison operators I defined weren't being optimized the way I expected.
Given this cut-down code:
#include <cstdint>
struct TwoParter {
std::uint16_t blk;
std::uint16_t ofs;
};
static_assert (sizeof(TwoParter) == sizeof(std::uint32_t), "pack densely");
bool equal1 (TwoParter const & lhs, TwoParter const & rhs) {
return lhs.blk == rhs.blk && lhs.ofs == rhs.ofs;
}
bool equal2 (TwoParter const & lhs, TwoParter const & rhs) {
auto lp = reinterpret_cast <std::uint32_t const *> (&lhs);
auto rp = reinterpret_cast <std::uint32_t const *> (&rhs);
return *lp == *rp;
}
GCC (7.1 on Compiler Explorer) produces the following assembly (options -m64 -std=c++11 -O3
):
equal1(TwoParter const&, TwoParter const&):
movzwl (%rsi), %edx
xorl %eax, %eax
cmpw %dx, (%rdi)
je .L5
rep ret
.L5:
movzwl 2(%rsi), %eax
cmpw %ax, 2(%rdi)
sete %al
ret
equal2(TwoParter const&, TwoParter const&):
movl (%rsi), %eax
cmpl %eax, (%rdi)
sete %al
ret
One of these seems to be doing more work than the other. But I just can't see how they're different: the assertion guarantees that the layout of the struct is such that the comparison as a uint23_t
must compare all of the same data as examining the uint16_t
fields separately. More importantly, this is x86, so the compiler already knew that would be the case. The short-circuiting behaviour of &&
shouldn't be important to the output, because its right-hand operand has no effects (and the compiler can see this), and since nothing else interesting is happening I can't imagine why it would want to e.g. defer loading the second half of the data.
Replacing the &&
with the &
operator gets rid of the jump but does not fundamentally change what the code does: it still generates two separate 16-bit comparisons, instead of comparing all of the data in one go, which indicates that the short-circuit is probably not the issue (although it does raise a related question of why it doesn't compile &&
and &
the same way either - surely one of the two should be "better" in both cases).
What interests me is that, also according to Compiler Explorer, all the major compilers (GCC, Clang, Intel, MSVC) seem to do roughly the same thing. That reduces the chances that this is an oversight on the part of one optimizer, but what I can't see is how my own assessment of this is actually wrong.
So there are two parts to this:
1) does equal1
really do the same thing as equal2
? Am I missing something crazy here?
2) if it does, why would compilers choose not to emit the shorter instruction sequence?
I'm sure that optimizations in this vein must be something compilers know about, because they would be more legitimately useful for speeding up other, more serious code like e.g. memcmp
shoving things into vector registers to compare more data at once.
Alignment requirement are not the same, TwoParter
has same alignment than std::uint16_t
.
Changing TwoParter
to
struct alignas(std::uint32_t) TwoParter {
std::uint16_t blk;
std::uint16_t ofs;
};
produces same asm for gcc 7.1:
equal1(TwoParter const&, TwoParter const&):
movl (%rsi), %eax
cmpl %eax, (%rdi)
sete %al
ret
equal2(TwoParter const&, TwoParter const&):
movl (%rsi), %eax
cmpl %eax, (%rdi)
sete %al
ret
Demo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With