Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In x86-64 asm: is there a way of optimising two adjacent 32-bit stores / writes to memory if the source operands are two immediate values?

Is there a good way of optimising this code (x86-64) ?

mov dword ptr[rsp], 0;
mov dword ptr[rsp+4], 0

where the immediate values could be any values, not necessarily zero, but in this instance always immediate constants.

Is the original pair of stores even slow? Write-combining in the hardware and parallel operation of the μops might just make everything ridiculously fast anyway? I’m wondering if there is no problem to fix.

I’m thinking of something like (don’t know if the following instructions even exist)

mov  qword ptr[rsp], 0

or

mov  eax, 0;
mov  qword ptr[rsp], rax    ; assuming we can spare a register, a bad idea to corrupt one though
like image 518
Cecil Ward Avatar asked Jul 19 '20 01:07

Cecil Ward


People also ask

What is an immediate value in assembly language?

Immediate value means the value is included on the opcode. If it refers to memory (array) or registers, it is not immediate. Also immediate value is an assembly concept (it applies to a lot of architectures), not an AT&T syntax one.

What are the 32-bit x86 arithmetic A registers called?

An x86 CPU has eight 32-bit general-purpose registers. For historical reasons, the registers are named { eax , ecx , edx , ebx , esp , ebp , esi , edi }.

What is x86-64 instruction set?

x86-64 (also known as x64, x86_64, AMD64, and Intel 64) is a 64-bit version of the x86 instruction set, first released in 1999. It introduced two new modes of operation, 64-bit mode and compatibility mode, along with a new 4-level paging mode.

Why is x86 assembly language programming so messy?

The topic of x86 assembly language programming is messy because: There are many different assemblers out there: MASM, NASM, gas, as86, TASM, a86, Terse, etc. All use radically different assembly languages.

What are the arithmetic operations in x86 assembly?

x86 assembly has the standard mathematical operations, add, sub, mul, with idiv; the logical operators and, or, xor, neg; bitshift arithmetic and logical, sal / sar, shl / shr; rotate with and without carry, rcl / rcr, rol / ror, a complement of BCD arithmetic instructions, aaa, aad, daa and others.

What is the x86 architecture?

The x86 architecture is the most popular architecture for desktop and laptop computers. Let’s see how we can program in assembly language for processors in this family. This document contains very brief examples of assembly language programs for the x86.

How do I get Started with x86 optimization?

This requires a fairly deep understanding of the x86 architecture, especially the behavior of the cache (s), pipelines and alignment bias. These specifics are well beyond the scope of this little document, but an excellent place to begin your study of this material is Agner Fog’s Optimization Guide or even Intel’s .


2 Answers

Yes, compile-time / asm-source-level write-coalescing is often a good idea, especially if both or the high half is zero (or -1) so you can do the whole qword with 1 instruction; modern x86 CPUs have efficient unaligned stores especially when it doesn't cross a cache line boundary.

You generally want to minimize total fused-domain uops (to get your code through the front-end as efficiently as possible), total code-size in bytes, and total unfused-domain uops (back end space in the scheduler / RS). In something like that priority. Also, there are uop-cache considerations for Sandybridge-family; 64-bit immediates, or 32-bit immediate + disp8/disp32, can need to borrow extra space from an adjacent entry in the uop cache line. (See Agner Fog's microarch pdf on https://agner.org/optimize/, the Sandybridge chapter. That still applies to later uarches like Skylake)

Also minimizing pressure on certain back-end execution ports that surrounding code uses heavily is good. Ice Lake has 2 store-data and store-address ports so can run both stores in parallel, but before that all x86 CPUs are limited to 1 store per clock (having only a single store-data port to write data into the store buffer). Commit to L1d cache is also limited to 1 store per clock from the store buffer. Out-of-order exec does smooth that out so 2 stores back to back isn't a big problem, but 2x 4-byte immediate stores takes a lot of instruction size.

x86 unfortunately doesn't have a mov r/m32, sign_extended_imm8, only imm32. (https://www.felixcloutier.com/x86/mov) x86-64 does have mov r/m64, sign_extended_imm32, though, which is what you should use:

mov    qword [rsp], 0      ; 8 bytes, 1 fused-domain uop on modern Intel and AMD CPUs

vs. 7 bytes + 8 bytes and 2 uops for mov dword [rsp],0 / mov dword [rsp+4], 0. xor-zeroing EAX / storing RAX would be smaller (code size), but cost 2 uops instead of 1.

assuming we can spare a register, a bad idea to corrupt one though

Hardly; you often have a use for a zeroed register, and xor-zeroing is literally as cheap as a NOP on Sandybridge-family. (And also cheap on AMD.) If you can do this store somewhere that makes it useful to have a zeroed register, this is very cheap:

xor   eax, eax
mov  [rsp], rax         ; better if you have a use for RAX later

Or for non-zero 64-bit values where you want mov r64, imm64, it's typical that you have a spare register you could use as a scratch destination. If you would have to spill a register or save/restore an extra reg around the whole function, then it's probably better to just do 2 separate dword-immediate stores if you can't do a single sign-extended-imm32.


For non-zero constants, if the whole qword constant can be represented as a sign-extended 32-bit immediate, use mov qword [rsp], imm32. (Or push imm32 and optimize away an earlier sub rsp, 8.)

If you know your qword memory location is 8-byte aligned, then it's worth combining even for an arbitrary 8-byte constant that doesn't fit in a 32-bit immediate:

mov  rax, 0x123456789abcdef0        ; 10 bytes, 1 uop
mov  [rsp], rax                     ; 4 bytes, 1 micro-fused uop, for port 4 + port 2,3, or 7

It's only somewhat better than doing 2 separate dword stores, and could be slower in the (probably?) rare case where it crosses a 64-byte cache-line boundary

mov  dword [rsp], 0x9abcdef0        ; 7 bytes, 1 micro-fused uop for port 4 + port 2,3, or 7
mov  dword [rsp+4], 0x12345678      ; 8 bytes, 1 micro-fused uop for port 4 + port 2,3, or 7

Or if your constant happens to fit in a 32-bit value zero-extended to 64-bit, but not sign-extended, you can mov eax, 0x87654321 (5 bytes, very efficient) / mov [rsp], rax.


If you want to do a qword reload later, definitely do a single qword store so store-forwarding can work efficiently.


Write-combining in the hardware

That's not the major factor. More important is OoO exec and the store buffer decoupling store execution from the surrounding code.

If you were actually hoping to get more than 1 store (of any width) per clock executed, you're definitely out of luck on uarches before Ice Lake. On any uarch (even non-x86), hardware store-coalescing happens after stores execute.

You're also out of luck if you're hoping it will coalesce and take fewer entries in the store buffer so it has more time / room to absorb two cache-miss stores. We don't have any real evidence of any x86 doing that to save store-buffer-drain bandwidth, or free up store-buffer entries sooner. See Why doesn't RFO after retirement break memory ordering? for my current understanding of (lack-of) store buffer coalescing. There's some evidence that Intel at least can commit store misses to LFBs on cache-miss stores to free up space in the store buffer, but only with the limits of program order, and no evidence of committing multiple per clock.

like image 52
Peter Cordes Avatar answered Dec 02 '22 03:12

Peter Cordes


Yes, you can combine your two 32-bit writes into a single 64-bit write, like so:

mov     QWORD PTR [rsp], 0

The immediate value is a 32-bit sign extended immediate, so it's not this simple if your second write is non-zero1, or if the MSB of the first write is 1. In that case, you can load a 64-bit constant using movabs and write that. E.g., to write 1 and 2,

movabs  rax,  0x200000001
mov     QWORD PTR [rsp], rax

The constant 0x200000001 results in the right values being written into each 32-bit half.

This trick is definitely worth it for the zero case and maybe worth it for the non-zero case and Peter's answer goes into much more detail on the tradeoffs in that latter case.

Compilers can also make this optimization (they call it "store combining" or something like that), meaning you can play with this on godbolt.


1 Except in the special case that the sign extension gives you exactly what you want. I.e., the second value is exactly 0xFFFFFFFF and the high bit of the first value is set.

like image 43
BeeOnRope Avatar answered Dec 02 '22 03:12

BeeOnRope