Given this code:
#include <string.h>
int equal4(const char* a, const char* b)
{
return memcmp(a, b, 4) == 0;
}
int less4(const char* a, const char* b)
{
return memcmp(a, b, 4) < 0;
}
GCC 7 on x86_64 introduced an optimization for the first case (Clang has done it for a long time):
mov eax, DWORD PTR [rsi]
cmp DWORD PTR [rdi], eax
sete al
movzx eax, al
But the second case still calls memcmp()
:
sub rsp, 8
mov edx, 4
call memcmp
add rsp, 8
shr eax, 31
Could a similar optimization be applied to the second case? What's the best assembly for this, and is there any clear reason why it isn't being done (by GCC or Clang)?
See it on Godbolt's Compiler Explorer: https://godbolt.org/g/jv8fcf
memcmp is often implemented in assembly to take advantage of a number of architecture-specific features, which can make it much faster than a simple loop in C.
The memcmp() built-in function compares the first count bytes of buf1 and buf2. The relation is determined by the sign of the difference between the values of the leftmost first pair of bytes that differ.
If you generate code for a little-endian platform, optimizing four-byte memcmp
for inequality to a single DWORD comparison is invalid.
When memcmp
compares individual bytes it goes from low-addressed bytes to high-addressed bytes, regardless of the platform.
In order for memcmp
to return zero all four bytes must be identical. Hence, the order of comparison does not matter. Therefore, DWORD optimization is valid, because you ignore the sign of the result.
However, when memcmp
returns a positive number, byte ordering matters. Hence, implementing the same comparison using 32-bit DWORD comparison requires a specific endianness: the platform must be big-endian, otherwise the result of comparison would be incorrect.
Endianness is the problem here. Consider this input:
a = 01 00 00 03 b = 02 00 00 02
If you compare these two arrays by treating them as 32-bit integers, then you'll find that a
is larger (because 0x03000001 > 0x02000002). On a big-endian machine, this test would probably work as expected.
As discussed in other answers/comments, using memcmp(a,b,4) < 0
is equivalent to an unsigned
comparison between big-endian integers. It couldn't inline as efficiently as == 0
on little-endian x86.
More importantly, the current version of this behaviour in gcc7/8 only looks for memcmp() == 0
or != 0
. Even on a big-endian target where this could inline just as efficiently for <
or >
, gcc won't do it. (Godbolt's newest big-endian compilers are PowerPC 64 gcc6.3, and MIPS/MIPS64 gcc5.4. mips
is big-endian MIPS, while mipsel
is little-endian MIPS.) If testing this with future gcc, use a = __builtin_assume_align(a, 4)
to make sure gcc doesn't have to worry about unaligned-load performance/correctness on non-x86. (Or just use const int32_t*
instead of const char*
.)
If/when gcc learns to inline memcmp
for cases other than EQ/NE, maybe gcc will do it on little-endian x86 when its heuristics tell it the extra code size will be worth it. e.g. in a hot loop when compiling with -fprofile-use
(profile-guided optimization).
If you want compilers to do a good job for this case, you should probably assign to a uint32_t
and use an endian-conversion function like ntohl
. But make sure you pick one that can actually inline; apparently Windows has an ntohl
that compiles to a DLL call. See other answers on that question for some portable-endian stuff, and also someone's imperfect attempt at a portable_endian.h
, and this fork of it. I was working on a version for a while, but never finished/tested it or posted it.
The pointer-casting may be Undefined Behaviour, depending on how you wrote the bytes and what the char*
points to. If you're not sure about strict-aliasing and/or alignment, memcpy
into abytes
. Most compilers are good at optimizing away small fixed-size memcpy
.
// I know the question just wonders why gcc does what it does,
// not asking for how to write it differently.
// Beware of alignment performance or even fault issues outside of x86.
#include <endian.h>
#include <stdint.h>
int equal4_optim(const char* a, const char* b) {
uint32_t abytes = *(const uint32_t*)a;
uint32_t bbytes = *(const uint32_t*)b;
return abytes == bbytes;
}
int less4_optim(const char* a, const char* b) {
uint32_t a_native = be32toh(*(const uint32_t*)a);
uint32_t b_native = be32toh(*(const uint32_t*)b);
return a_native < b_native;
}
I checked on Godbolt, and that compiles to efficient code (basically identical to what I wrote in asm below), especially on big-endian platforms, even with old gcc. It also makes much better code than ICC17, which inlines memcmp
but only to a byte-compare loop (even for the == 0
case.
I think this hand-crafted sequence is an optimal implementation of less4()
(for the x86-64 SystemV calling convention, like used in the question, with const char *a
in rdi
and b
in rsi
).
less4:
mov edi, [rdi]
mov esi, [rsi]
bswap edi
bswap esi
# data loaded and byte-swapped to native unsigned integers
xor eax,eax # solves the same problem as gcc's movzx, see below
cmp edi, esi
setb al # eax=1 if *a was Below(unsigned) *b, else 0
ret
Those are all single-uop instructions on Intel and AMD CPUs since K8 and Core2 (http://agner.org/optimize/).
Having to bswap both operands has an extra code-size cost vs. the == 0
case: we can't fold one of the loads into a memory operand for cmp
. (That saves code size, and uops thanks to micro-fusion.) This is on top the two extra bswap
instructions.
On CPUs that support movbe
, it can save code size: movbe ecx, [rsi]
is a load + bswap. On Haswell, it's 2 uops, so presumably it decodes to the same uops as mov ecx, [rsi]
/ bswap ecx
. On Atom/Silvermont, it's handled right in the load ports, so it's fewer uops as well as smaller code-size.
See the setcc
part of my xor-zeroing answer for more about why xor/cmp/setcc (which clang uses) is better than cmp/setcc/movzx (typical for gcc).
In the usual case where this inlines into code that branches on the result, the setcc + zero-extend are replaced with a jcc; the compiler optimizes away creating a boolean return value in a register. This is yet another advantage of inlining: the library memcmp
does have to create an integer boolean return value which the caller tests, because no x86 ABI/calling convention allows for returning boolean conditions in flags. (I don't know of any non-x86 calling conventions that do that either). For most library memcmp
implementations, there's also significant overhead in choosing a strategy depending on length, and maybe alignment checking. That can be pretty cheap, but for size 4 it's going to be more than the cost of all the real work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With