I'm reading Computer Systems: A Programmer's Perspective, 3/E (CS:APP3e) Randal E. Bryant and David R. O'Hallaron and the author says "Observe that the movl instruction of line 6 reads 4 bytes from memory; the following addb instruction only makes use of the low-order byte"
Line 6, why do they use movl? Why don't they movb 8(%rsp), %dl?
void proc(a1, a1p, a2, a2p, a3, a3p, a4, a4p)
Arguments passed as follows:
a1 in %rdi (64 bits)
a1p in %rsi (64 bits)
a2 in %edx (32 bits)
a2p in %rcx (64 bits)
a3 in %r8w (16 bits)
a3p in %r9 (64 bits)
a4 at %rsp+8 ( 8 bits)
a4p at %rsp+16 (64 bits)
1 proc:
2 movq 16(%rsp), %rax Fetch a4p (64 bits)
3 addq %rdi, (%rsi) *a1p += a1 (64 bits)
4 addl %edx, (%rcx) *a2p += a2 (32 bits)
5 addw %r8w, (%r9) *a3p += a3 (16 bits)
6 movl 8(%rsp), %edx Fetch a4 (8 bits)
7 addb %dl, (%rax) *a4p += a4 (8 bits)
8 ret Return
TL:DR: You can, GCC just chooses not, saving 1 byte of code-size vs. a normal movzbl byte load and avoiding any partial-register penalties from a movb load+merge. But for obscure reasons, this won't cause a store-forwarding stall when loading a function arg.
(This code is exactly what we get from GCC4.8 and later with gcc -O1 with those C statements and integer types of those widths. See it and clang on the Godbolt compiler explorer GCC -O3 schedules the movl one instruction earlier.)
There's no correctness reason for doing it this way, only possible performance. You're correct that a byte load would work just as well. (I've omitted redundant operand-size suffixes because they're implied by the register operands).
mov 8(%rsp), %dl # byte load, merging into RDX
add %dl, (%rax)
What you're likely to get from a C compiler is a byte load with zero-extension. (e.g. GCC4.7 and earlier does this)
movzbl 8(%rsp), %edx # byte load zero-extended into RDX
add %dl, (%rax)
movzbl (aka MOVZX in Intel syntax) is your go-to instruction for loading bytes / words, not movb or movw. It's always safe, and on modern CPUs MOVZX loads are literally as fast as dword mov loads, no extra latency or extra uops; handled right in the load execution unit. (Intel since Core 2 or earlier, AMD since at least Ryzen. https://agner.org/optimize/).
The only cost being 1 extra byte of code size (larger opcode). movsbl or movsbq (aka MOVSX) sign-extension are equally efficient on more recent CPUs, but on some AMD (Like some Bulldozer-family) they're 1 cycle higher latency than MOVZX loads. So prefer MOVZX if all you care about is avoiding partial-register shenanigans when loading a byte.
Usually only use movb or movw (with register destinations) if you specifically want to merge into the low byte or word of the existing 64-bit register. Byte / word stores are perfectly fine on x86, I'm only talking about mov mem-to-reg or reg-to-reg. There are exceptions to this rule; sometimes you can safely use byte operand size without problems if you're careful and understand that microarchitecture(s) you care about your code running efficiently on. And beware that intentionally merging by writing a byte reg then reading a larger reg can cause partial-register merging stalls on some CPUs.
Writing to %dl would have a false dependency on the instructions (in your caller) that wrote EDX on some CPUs, including current Intel and all AMD. (Why doesn't GCC use partial registers?). Clang and ICC don't care and do it anyway, implementing the function the way you expected.
movl writes the full 64-bit register (by implicit zero-extension when writing a 32-bit register) avoiding that problem.
But reading a dword from 8(%rsp) could introduce a store-forwarding stall, if the caller only used a byte store. If the caller wrote that memory with a push, you're fine. But if the caller only used movb $123, (%rsp) before the call into already-reserved stack space, now your function is reading a dword from a location where the last store was a byte. Unless there was some kind of other stall (e.g. in code fetch after calling your function), the byte is probably in the store buffer when the load uop executes, but the load needs that plus 3 bytes from cache. Or from some earlier store that's also still in the store buffer, so it also has to scan the store buffer for all potential matches before merging the byte from the store buffer with the other bytes from cache. The fast path for store-forwarding only works when all the data you're loading comes from exactly one store. (Can modern x86 implementations store-forward from more than one prior store?)
clang/gcc sign- or zero-extend narrow args to 32-bit, even though the System V ABI as written doesn't (yet?) require it. Clang-generated code also depends on it. This apparently includes args passed in memory, as we can see from looking at the caller on Godbolt. (I used __attribute__((noinline)) so I could compile with optimization enabled but still not have the call inline and optimize away. Otherwise I could have just commented out the body and looked at a caller that could only see a prototype.
This is not part of C's "default argument promotions" for calling unprototyped functions. The C type of the narrow args are still short or char. This is only a calling-convention feature that lets the callee make assumptions about bits in registers (or memory) outside of the object-representation of the C object. It would be more useful if the upper 32 bits were required to be zero, though, because you still can't use them as array indices for 64-bit addressing modes. But you can do int_arg += char_arg without a MOVSX first. So it can make code more efficient when you use narrow args and they get implicitly promoted to int by C rules for binary operators like +.
By compiling the caller with gcc -O3 -maccumulate-outgoing-args (or -O0 or -O1), I got GCC to reserve stack space with sub and then use movl $4, (%rsp) before call proc a function that calls yours. It would have been more efficient (smaller code-size) for gcc to use movb, but it chose to use a movl with a 32-bit immediate. I think this is because it's implementing that unwritten rule in the calling convention, rather than some other reason.
More usually (without -maccumulate-outgoing-args) the caller will use push $4 or push %rdi to do a qword store before the load, which can also store-forward efficiently to a dword (or byte) load. So either way, the arg will have been written with at least a dword store, making a dword reload safe for performance.
A dword mov load has 1 byte smaller code-size than a movzbl load, and avoids the possible extra cost of a MOVSX or MOVZX (on old AMD CPUs and extremely old Intel CPUs (P5)). So I think it's optimal.
GCC4.7 and earlier do use a movzbl (MOVZX) load for the char a4 arg like I recommended as the generally-safe option, but GCC4.8 and later use a movl.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With