Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is memcmp(a, b, 4) only sometimes optimized to a uint32 comparison?

Given this code:

#include <string.h>

int equal4(const char* a, const char* b)
{
    return memcmp(a, b, 4) == 0;
}

int less4(const char* a, const char* b)
{
    return memcmp(a, b, 4) < 0;
}

GCC 7 on x86_64 introduced an optimization for the first case (Clang has done it for a long time):

    mov     eax, DWORD PTR [rsi]
    cmp     DWORD PTR [rdi], eax
    sete    al
    movzx   eax, al

But the second case still calls memcmp():

    sub     rsp, 8
    mov     edx, 4
    call    memcmp
    add     rsp, 8
    shr     eax, 31

Could a similar optimization be applied to the second case? What's the best assembly for this, and is there any clear reason why it isn't being done (by GCC or Clang)?

See it on Godbolt's Compiler Explorer: https://godbolt.org/g/jv8fcf

like image 917
John Zwinck Avatar asked Jul 12 '17 08:07

John Zwinck


People also ask

Is memcmp faster than for loop?

memcmp is often implemented in assembly to take advantage of a number of architecture-specific features, which can make it much faster than a simple loop in C.

Does memcmp compare byte by byte?

The memcmp() built-in function compares the first count bytes of buf1 and buf2. The relation is determined by the sign of the difference between the values of the leftmost first pair of bytes that differ.


3 Answers

If you generate code for a little-endian platform, optimizing four-byte memcmp for inequality to a single DWORD comparison is invalid.

When memcmp compares individual bytes it goes from low-addressed bytes to high-addressed bytes, regardless of the platform.

In order for memcmp to return zero all four bytes must be identical. Hence, the order of comparison does not matter. Therefore, DWORD optimization is valid, because you ignore the sign of the result.

However, when memcmp returns a positive number, byte ordering matters. Hence, implementing the same comparison using 32-bit DWORD comparison requires a specific endianness: the platform must be big-endian, otherwise the result of comparison would be incorrect.

like image 198
Sergey Kalinichenko Avatar answered Nov 08 '22 18:11

Sergey Kalinichenko


Endianness is the problem here. Consider this input:

a = 01 00 00 03 b = 02 00 00 02 

If you compare these two arrays by treating them as 32-bit integers, then you'll find that a is larger (because 0x03000001 > 0x02000002). On a big-endian machine, this test would probably work as expected.

like image 25
r3mainer Avatar answered Nov 08 '22 17:11

r3mainer


As discussed in other answers/comments, using memcmp(a,b,4) < 0 is equivalent to an unsigned comparison between big-endian integers. It couldn't inline as efficiently as == 0 on little-endian x86.

More importantly, the current version of this behaviour in gcc7/8 only looks for memcmp() == 0 or != 0. Even on a big-endian target where this could inline just as efficiently for < or >, gcc won't do it. (Godbolt's newest big-endian compilers are PowerPC 64 gcc6.3, and MIPS/MIPS64 gcc5.4. mips is big-endian MIPS, while mipsel is little-endian MIPS.) If testing this with future gcc, use a = __builtin_assume_align(a, 4) to make sure gcc doesn't have to worry about unaligned-load performance/correctness on non-x86. (Or just use const int32_t* instead of const char*.)

If/when gcc learns to inline memcmp for cases other than EQ/NE, maybe gcc will do it on little-endian x86 when its heuristics tell it the extra code size will be worth it. e.g. in a hot loop when compiling with -fprofile-use (profile-guided optimization).


If you want compilers to do a good job for this case, you should probably assign to a uint32_t and use an endian-conversion function like ntohl. But make sure you pick one that can actually inline; apparently Windows has an ntohl that compiles to a DLL call. See other answers on that question for some portable-endian stuff, and also someone's imperfect attempt at a portable_endian.h, and this fork of it. I was working on a version for a while, but never finished/tested it or posted it.

The pointer-casting may be Undefined Behaviour, depending on how you wrote the bytes and what the char* points to. If you're not sure about strict-aliasing and/or alignment, memcpy into abytes. Most compilers are good at optimizing away small fixed-size memcpy.

// I know the question just wonders why gcc does what it does,
// not asking for how to write it differently.
// Beware of alignment performance or even fault issues outside of x86.

#include <endian.h>
#include <stdint.h>

int equal4_optim(const char* a, const char* b) {
    uint32_t abytes = *(const uint32_t*)a;
    uint32_t bbytes = *(const uint32_t*)b;

    return abytes == bbytes;
}


int less4_optim(const char* a, const char* b) {
    uint32_t a_native = be32toh(*(const uint32_t*)a);
    uint32_t b_native = be32toh(*(const uint32_t*)b);

    return a_native < b_native;
}

I checked on Godbolt, and that compiles to efficient code (basically identical to what I wrote in asm below), especially on big-endian platforms, even with old gcc. It also makes much better code than ICC17, which inlines memcmp but only to a byte-compare loop (even for the == 0 case.


I think this hand-crafted sequence is an optimal implementation of less4() (for the x86-64 SystemV calling convention, like used in the question, with const char *a in rdi and b in rsi).

less4:
    mov   edi, [rdi]
    mov   esi, [rsi]
    bswap edi
    bswap esi
    # data loaded and byte-swapped to native unsigned integers
    xor   eax,eax    # solves the same problem as gcc's movzx, see below
    cmp   edi, esi
    setb  al         # eax=1 if *a was Below(unsigned) *b, else 0
    ret

Those are all single-uop instructions on Intel and AMD CPUs since K8 and Core2 (http://agner.org/optimize/).

Having to bswap both operands has an extra code-size cost vs. the == 0 case: we can't fold one of the loads into a memory operand for cmp. (That saves code size, and uops thanks to micro-fusion.) This is on top the two extra bswap instructions.

On CPUs that support movbe, it can save code size: movbe ecx, [rsi] is a load + bswap. On Haswell, it's 2 uops, so presumably it decodes to the same uops as mov ecx, [rsi] / bswap ecx. On Atom/Silvermont, it's handled right in the load ports, so it's fewer uops as well as smaller code-size.

See the setcc part of my xor-zeroing answer for more about why xor/cmp/setcc (which clang uses) is better than cmp/setcc/movzx (typical for gcc).

In the usual case where this inlines into code that branches on the result, the setcc + zero-extend are replaced with a jcc; the compiler optimizes away creating a boolean return value in a register. This is yet another advantage of inlining: the library memcmp does have to create an integer boolean return value which the caller tests, because no x86 ABI/calling convention allows for returning boolean conditions in flags. (I don't know of any non-x86 calling conventions that do that either). For most library memcmp implementations, there's also significant overhead in choosing a strategy depending on length, and maybe alignment checking. That can be pretty cheap, but for size 4 it's going to be more than the cost of all the real work.

like image 43
Peter Cordes Avatar answered Nov 08 '22 19:11

Peter Cordes