What should optimized compiled code for copying 3 bytes from one place to another, say, using memcpy(,,3)
look like, in terms of assembly instructions?
Consider the following program:
#include <string.h>
int main() {
int* p = (int*) 0x10;
int x = 0;
memcpy(&x, p, 4);
x = x * (x > 1 ? 2 : 3);
memcpy(p, &x, 4);
return 0;
}
it's a bit contrived a will cause a segmentation violation, but I need those instructions so that compiling with -O3
doesn't make all of it go away. When I compile this (GodBolt, GCC 6.3 -O3), I get:
main:
mov edx, DWORD PTR ds:16
xor eax, eax
cmp edx, 1
setle al
add eax, 2
imul eax, edx
mov DWORD PTR ds:16, eax
xor eax, eax
ret
great - a single mov
of a DWORD (= 4 bytes) from memory to a register. Nice and optimized. Now let's change the memcpy(&x, p1, 4)
into memcpy(&x, p1, 3)
? The compilation result becomes:
main:
mov DWORD PTR [rsp-4], 0
movzx eax, WORD PTR ds:16
mov WORD PTR [rsp-4], ax
movzx eax, BYTE PTR ds:18
mov BYTE PTR [rsp-2], al
mov edx, DWORD PTR [rsp-4]
xor eax, eax
cmp edx, 1
setle al
add eax, 2
imul eax, edx
mov DWORD PTR ds:16, eax
xor eax, eax
ret
I'm not much of an exprt on Intel X86_64 assembly (read: I can't even read it properly when it's complicated), so - I don't quite get this. I mean, I get what's happening in the first 6 instructions and why so many of them are necessary. Why aren't two moves sufficient? A mov WORD PTR
int al
and a mov BYTE PTR
into ah
?
... so, I came here to ask. As I was writing the question I noticed GodBolt also has clang as an option. Well, clang (3.9.0 -O3) does this:
main: # @main
movzx eax, byte ptr [18]
shl eax, 16
movzx ecx, word ptr [16]
or ecx, eax
cmp ecx, 2
sbb eax, eax
and eax, 1
or eax, 2
imul eax, ecx
mov dword ptr [16], eax
xor eax, eax
ret
which looks more like what I expected. What explains the difference?
Notes:
x = 0
.mov
's.The behavior is similar if we forego memcpy'ing for the following:
#include <string.h>
typedef struct {
unsigned char data[3];
} uint24_t;
int main() {
uint24_t* p = (uint24_t*) 0x30;
int x = 0;
*((uint24_t*) &x) = *p;
x = x * (x > 1 ? 2 : 3);
*p = *((uint24_t*) &x);
return 0;
}
If you want to contrast with what happens when the relevant code is in a function, have a look at this or the uint24_t struct version (GodBolt). Then have a look at what happens for 4-byte values.
You should get much better code from copying 4 bytes and masking off the top one, e.g. with x & 0x00ffffff
. That lets the compiler know it's allowed to read 4 bytes, not just the 3 that the C source reads.
Yes, that helps a ton: it saves gcc and clang from storing a 4B zero, then copying three bytes and reloading 4. They just load 4, mask, store, and use the value that's still in the register. Part of that might be from not knowing if *p aliases *q.
int foo(int *p, int *q) {
//*p = 0;
//memcpy(p, q, 3);
*p = (*q)&0x00ffffff;
return *p;
}
mov eax, DWORD PTR [rsi] # load
and eax, 16777215 # mask
mov DWORD PTR [rdi], eax # store
ret # and leave it in eax as return value
Why aren't two moves sufficient? A mov WORD PTR into
al
followed by a mov BYTE PTR intoah
?
AL and AH are 8-bit registers. You can't put a 16-bit word into AL. This is why your last clang-output block loads two separate registers and merges with a shift+or
, in the case where it knows it's allowed to mess with all 4 bytes of x
.
If you were merging two separate one-byte values, you could load them to AL and AH and then use AX, but that leads to a partial-register stall on Intel pre-Haswell.
You could do a word-load into AX (or preferably movzx into eax for various reasons including correctness and avoiding a false dependency on the old value of EAX), left-shift EAX, and then a byte-load into AL.
But compilers don't tend to do that, because partial-register stuff has been very bad juju for many years, and is only efficient on recent CPUs (Haswell, and maybe IvyBridge). It would cause serious stalls on Nehalem and Core2. (See Agner Fog's microarch pdf; search for partial-register or look for it in the index. See other links in the x86 tag wiki.) Maybe in a few years, -mtune=haswell
will enable partial-register tricks to save the OR instruction that clang uses to merge.
Write functions that take args and return a value so you don't have to make them super-weird to not optimize away. e.g. a function that takes two int* args and does a 3 byte memcpy between them.
This on Godbolt (with gcc and clang), with colour highlighting
void copy3(int *p, int *q) { memcpy(p, q, 3); }
clang3.9 -O3 does exactly what you expected: a byte and a word copy.
mov al, byte ptr [rsi + 2]
mov byte ptr [rdi + 2], al
movzx eax, word ptr [rsi]
mov word ptr [rdi], ax
ret
To get the sillyness you managed to generate, zero the destination first, and then read it back after a three byte copy:
int foo(int *p, int *q) {
*p = 0;
memcpy(p, q, 3);
return *p;
}
clang3.9 -O3
mov dword ptr [rdi], 0 # *p = 0
mov al, byte ptr [rsi + 2]
mov byte ptr [rdi + 2], al # byte copy
movzx eax, word ptr [rsi]
mov word ptr [rdi], ax # word copy
mov eax, dword ptr [rdi] # read the whole thing, causing a store-forwarding stall
ret
gcc doesn't do any better (except on CPUs that don't rename partial regs, since it avoids a false dependency on the old value of EAX by using movzx
for the byte copy as well).
The size three is an ugly size and compilers are not perfect.
The compiler cannot generate an access to a memory location you haven't requested, so it needs two moves.
While it seems trivial for you, remember that you asked for memcpy(&x, p, 4);
which is a copy from memory to memory.
Evidently GCC and the older versions of Clang are not smart enough to figure it out there is no reason to pass for a temporary in memory.
What GCC is doing with the first six instructions is basically constructing a DWORD at [rsp-4]
with the three bytes, as you requested
mov DWORD PTR [rsp-4], 0 ;DWORD is 0
movzx eax, WORD PTR ds:16 ;EAX = byte 0 and byte 1
mov WORD PTR [rsp-4], ax ;DWORD has byte 0 and byte 1
movzx eax, BYTE PTR ds:18 ;EAX = byte 2
mov BYTE PTR [rsp-2], al ;DWORD has byte 0, byte 1 and byte 2
mov edx, DWORD PTR [rsp-4] ;As previous from henceon
It is using movzx eax, ...
to prevent a partial register stall.
The compilers did a great job already by eliding the call to memcpy
and as you said the example is "a bit contrived" to follow, even for a human.
The memcpy
optimisations must work for any size, including those that cannot fit a register. It's not easy to get it right every time.
Considering that L1 access latencies have been lowered considerably in the recent architectures and that [rsp-4]
is very likely to be in the cache, I'm not sure it's worth messing with the optimisation code in the GCC source.
It is surely worth filing a bug for a missed optimisation and see what the developers has to say.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With