clang vs gcc for copying 3 bytes on x86_64 - number of mov's

Question

What should optimized compiled code for copying 3 bytes from one place to another, say, using memcpy(,,3) look like, in terms of assembly instructions?

Consider the following program:

#include <string.h>
int main() {
  int* p = (int*) 0x10;
  int x = 0;
  memcpy(&x, p, 4);
  x = x * (x > 1 ? 2 : 3);
  memcpy(p, &x, 4);  
  return 0;
}

it's a bit contrived a will cause a segmentation violation, but I need those instructions so that compiling with -O3 doesn't make all of it go away. When I compile this (GodBolt, GCC 6.3 -O3), I get:

main:
        mov     edx, DWORD PTR ds:16
        xor     eax, eax
        cmp     edx, 1
        setle   al
        add     eax, 2
        imul    eax, edx
        mov     DWORD PTR ds:16, eax
        xor     eax, eax
        ret

great - a single mov of a DWORD (= 4 bytes) from memory to a register. Nice and optimized. Now let's change the memcpy(&x, p1, 4) into memcpy(&x, p1, 3)? The compilation result becomes:

main:
        mov     DWORD PTR [rsp-4], 0
        movzx   eax, WORD PTR ds:16
        mov     WORD PTR [rsp-4], ax
        movzx   eax, BYTE PTR ds:18
        mov     BYTE PTR [rsp-2], al
        mov     edx, DWORD PTR [rsp-4]
        xor     eax, eax
        cmp     edx, 1
        setle   al
        add     eax, 2
        imul    eax, edx
        mov     DWORD PTR ds:16, eax
        xor     eax, eax
        ret

I'm not much of an exprt on Intel X86_64 assembly (read: I can't even read it properly when it's complicated), so - I don't quite get this. I mean, I get what's happening in the first 6 instructions and why so many of them are necessary. Why aren't two moves sufficient? A mov WORD PTR int al and a mov BYTE PTR into ah?

... so, I came here to ask. As I was writing the question I noticed GodBolt also has clang as an option. Well, clang (3.9.0 -O3) does this:

main:                                   # @main
        movzx   eax, byte ptr [18]
        shl     eax, 16
        movzx   ecx, word ptr [16]
        or      ecx, eax
        cmp     ecx, 2
        sbb     eax, eax
        and     eax, 1
        or      eax, 2
        imul    eax, ecx
        mov     dword ptr [16], eax
        xor     eax, eax
        ret

which looks more like what I expected. What explains the difference?

Notes:

It's the same behavior essentially if I don't initialize x = 0.
Other GCC versions do about the same thing as GCC 6.3, but GCC 7 is down to 5 instead of 6 mov's.
Other versions of clang (starting from 3.4) do about the same thing.

The behavior is similar if we forego memcpy'ing for the following:

#include <string.h>

typedef struct {
  unsigned char data[3];
}  uint24_t;

int main() {
  uint24_t* p = (uint24_t*) 0x30;
  int x = 0;
  *((uint24_t*) &x) = *p;
  x = x * (x > 1 ? 2 : 3);
  *p = *((uint24_t*) &x);
  return 0;
}

If you want to contrast with what happens when the relevant code is in a function, have a look at this or the uint24_t struct version (GodBolt). Then have a look at what happens for 4-byte values.

Peter Cordes · Accepted Answer

You should get much better code from copying 4 bytes and masking off the top one, e.g. with x & 0x00ffffff. That lets the compiler know it's allowed to read 4 bytes, not just the 3 that the C source reads.

Yes, that helps a ton: it saves gcc and clang from storing a 4B zero, then copying three bytes and reloading 4. They just load 4, mask, store, and use the value that's still in the register. Part of that might be from not knowing if *p aliases *q.

int foo(int *p, int *q) {
  //*p = 0;
  //memcpy(p, q, 3);
  *p = (*q)&0x00ffffff;
  return *p;
}

    mov     eax, DWORD PTR [rsi]     # load
    and     eax, 16777215            # mask
    mov     DWORD PTR [rdi], eax     # store
    ret                              # and leave it in eax as return value

Why aren't two moves sufficient? A mov WORD PTR into al followed by a mov BYTE PTR into ah?

AL and AH are 8-bit registers. You can't put a 16-bit word into AL. This is why your last clang-output block loads two separate registers and merges with a shift+or, in the case where it knows it's allowed to mess with all 4 bytes of x.

If you were merging two separate one-byte values, you could load them to AL and AH and then use AX, but that leads to a partial-register stall on Intel pre-Haswell.

You could do a word-load into AX (or preferably movzx into eax for various reasons including correctness and avoiding a false dependency on the old value of EAX), left-shift EAX, and then a byte-load into AL.

But compilers don't tend to do that, because partial-register stuff has been very bad juju for many years, and is only efficient on recent CPUs (Haswell, and maybe IvyBridge). It would cause serious stalls on Nehalem and Core2. (See Agner Fog's microarch pdf; search for partial-register or look for it in the index. See other links in the x86 tag wiki.) Maybe in a few years, -mtune=haswell will enable partial-register tricks to save the OR instruction that clang uses to merge.

Instead of writing such a contrived function:

Write functions that take args and return a value so you don't have to make them super-weird to not optimize away. e.g. a function that takes two int* args and does a 3 byte memcpy between them.

This on Godbolt (with gcc and clang), with colour highlighting

void copy3(int *p, int *q) { memcpy(p, q, 3); }

 clang3.9 -O3 does exactly what you expected: a byte and a word copy.
    mov     al, byte ptr [rsi + 2]
    mov     byte ptr [rdi + 2], al
    movzx   eax, word ptr [rsi]
    mov     word ptr [rdi], ax
    ret

To get the sillyness you managed to generate, zero the destination first, and then read it back after a three byte copy:

int foo(int *p, int *q) {
  *p = 0;
  memcpy(p, q, 3);
  return *p;
}

  clang3.9 -O3
    mov     dword ptr [rdi], 0       # *p = 0
    mov     al, byte ptr [rsi + 2]
    mov     byte ptr [rdi + 2], al   # byte copy
    movzx   eax, word ptr [rsi]
    mov     word ptr [rdi], ax       # word copy
    mov     eax, dword ptr [rdi]     # read the whole thing, causing a store-forwarding stall
    ret

gcc doesn't do any better (except on CPUs that don't rename partial regs, since it avoids a false dependency on the old value of EAX by using movzx for the byte copy as well).

Margaret Bloom · Answer

The size three is an ugly size and compilers are not perfect.

The compiler cannot generate an access to a memory location you haven't requested, so it needs two moves.

While it seems trivial for you, remember that you asked for memcpy(&x, p, 4); which is a copy from memory to memory.
Evidently GCC and the older versions of Clang are not smart enough to figure it out there is no reason to pass for a temporary in memory.

What GCC is doing with the first six instructions is basically constructing a DWORD at [rsp-4] with the three bytes, as you requested

mov     DWORD PTR [rsp-4], 0              ;DWORD is 0

movzx   eax, WORD PTR ds:16               ;EAX = byte 0 and byte 1
mov     WORD PTR [rsp-4], ax              ;DWORD has byte 0 and byte 1

movzx   eax, BYTE PTR ds:18               ;EAX = byte 2
mov     BYTE PTR [rsp-2], al              ;DWORD has byte 0, byte 1 and byte 2

mov     edx, DWORD PTR [rsp-4]            ;As previous from henceon

It is using movzx eax, ... to prevent a partial register stall.

The compilers did a great job already by eliding the call to memcpy and as you said the example is "a bit contrived" to follow, even for a human. The memcpy optimisations must work for any size, including those that cannot fit a register. It's not easy to get it right every time.

Considering that L1 access latencies have been lowered considerably in the recent architectures and that [rsp-4] is very likely to be in the cache, I'm not sure it's worth messing with the optimisation code in the GCC source.
It is surely worth filing a bug for a missed optimisation and see what the developers has to say.

clang vs gcc for copying 3 bytes on x86_64 - number of mov's

Tags:

compiler-optimization

gcc

assembly

clang

memcpy

einpoklum

2 Answers

Instead of writing such a contrived function:

Peter Cordes

Margaret Bloom

Recent Activity

Donate For Us

clang vs gcc for copying 3 bytes on x86_64 - number of mov's

Tags:

compiler-optimization

gcc

assembly

clang

memcpy

einpoklum

2 Answers

Instead of writing such a contrived function:

Peter Cordes

Margaret Bloom

Related questions

Recent Activity

Donate For Us