A colleague showed me code that I thought wouldn't be necessary, but sure enough, it was. I would expect most compilers would see all three of these attempts at equality tests as equivalent:
#include <cstdint>
#include <cstring>
struct Point {
std::int32_t x, y;
};
[[nodiscard]]
bool naiveEqual(const Point &a, const Point &b) {
return a.x == b.x && a.y == b.y;
}
[[nodiscard]]
bool optimizedEqual(const Point &a, const Point &b) {
// Why can't the compiler produce the same assembly in naiveEqual as it does here?
std::uint64_t ai, bi;
static_assert(sizeof(Point) == sizeof(ai));
std::memcpy(&ai, &a, sizeof(Point));
std::memcpy(&bi, &b, sizeof(Point));
return ai == bi;
}
[[nodiscard]]
bool optimizedEqual2(const Point &a, const Point &b) {
return std::memcmp(&a, &b, sizeof(a)) == 0;
}
[[nodiscard]]
bool naiveEqual1(const Point &a, const Point &b) {
// Let's try avoiding any jumps by using bitwise and:
return (a.x == b.x) & (a.y == b.y);
}
But to my surprise, only the ones with memcpy
or memcmp
get turned into a single 64-bit compare by GCC. Why? (https://godbolt.org/z/aP1ocs)
Isn't it obvious to the optimizer that if I check equality on contiguous pairs of four bytes that that's the same as comparing on all eight bytes?
An attempt to avoid separately booleanizing the two parts compiles somewhat more efficiently (one fewer instruction and no false dependency on EDX), but still two separate 32-bit operations.
bool bithackEqual(const Point &a, const Point &b) {
// a^b == 0 only if they're equal
return ((a.x ^ b.x) | (a.y ^ b.y)) == 0;
}
GCC and Clang both have the same missed optimizations when passing the structs by value (so a
is in RDI and b
is in RSI because that's how x86-64 System V's calling convention packs structs into registers): https://godbolt.org/z/v88a6s. The memcpy / memcmp versions both compile to cmp rdi, rsi
/ sete al
, but the others do separate 32-bit operations.
struct alignas(uint64_t) Point
surprisingly still helps in the by-value case where arguments are in registers, optimizing both naiveEqual versions for GCC, but not the bithack XOR/OR. (https://godbolt.org/z/ofGa1f). Does this give us any hints about GCC's internals? Clang isn't helped by alignment.
If you "fix" the alignment, all give the same assembly language output (with GCC):
struct alignas(std::int64_t) Point {
std::int32_t x, y;
};
Demo
As a note, some correct/legal ways to do some stuff (as type punning) is to use memcpy
, so having specific optimization (or be more aggressive) when using that function seems logical.
There's a performance cliff you risk falling off of when implementing this as a single 64-bit comparison:
You break store to load forwarding.
If the 32-bit numbers in the structs are written to memory by separate store instructions, and then loaded back from memory with 64-bit load instructions quickly (before the stores hit L1$), your execution will stall until the stores commit to globally visible cache coherent L1$. If the loads are 32-bit loads that match the previous 32-bit stores, modern CPUs will avoid the store-load stall by forwarding the stored value to the load instruction before the store reaches cache. This violates sequential consistency if multiple CPUs access the memory (a CPU sees its own stores in a different order than other CPUs do), but is allowed by most modern CPU architectures, even x86. The forwarding also allows much more code to be executed completely speculatively, because if the execution has to be rolled back, no other CPU can have seen the store for the code that used the loaded value on this CPU to be speculatively executed.
If you want this to use 64-bit operations and you don't want this perf cliff, you may want to ensure the struct is also always written as a single 64-bit number.
Why can't the compiler generate [same assembly as memcpy version]?
The compiler "could" in the sense that it would be allowed to.
The compiler simply doesn't. Why it doesn't is beyond my knowledge as that requires deep knowledge of how the optimiser has been implemented. But, the answer may range from "there is no logic covering such transformation" to "the rules aren't tuned to assume one output is faster than the other" on all target CPUs.
If you use Clang instead of GCC, you'll notice that it produces same output for naiveEqual
and naiveEqual1
and that assembly has no jump. It is same as for the "optimised" version except for using two 32 bit instructions in place of one 64 bit instruction. Furthermore restricting the alignment of Point
as shown in Jarod42's answer has no effect to the optimiser.
MSVC behaves like Clang in the sense that it is unaffected by the alignment, but differently in the sense that it doesn't get rid of the jump in naiveEqual
.
For what its worth, the compilers (I checked GCC and Clang) produce essentially same output for the C++20 defaulted comparison as they do fornaiveEqual
. For whatever reason, GCC opted to use jne
instead of je
for the jump.
is this a missing compiler optimization
With the assumption that one is always faster than the other on the target CPUs, that would be fair conclusion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With