In the study of assembler and processor, one thing takes me out, how is done the instruction :
add mem, 1
In my head, the processor cannot load the memory value and process the arithmetic operation during the same instruction. So I figure it takes place like:
mov reg, mem
add reg, 1
mov mem, reg
If I consider a processor with a RISC Pipeline, we can observe some stalls. It's surprising for an instruction as simple as i++
:
| Fetch | Decode | Exec | Memory | WriteB |
| Fetch | | | Decode | Exec | Memory | WriteB |
| Fetch | | | | Decode | Exec | Memory | WriteB |
(As I could read in Patterson's book Computer Architecture: A Quantative Approach, registers are read in Decode uOp, Store/Load in Memory uOp and we allow ourselves to take the value of a register at the Memory uOp.)
Am I right? or the modern processors have specific methods to do that more efficiently?
The CPU performs basic arithmetic, logic, controlling, and input/output (I/O) operations specified by the instructions in the program. This contrasts with external components such as main memory and I/O circuitry, and specialized processors such as graphics processing units (GPUs).
The arithmetic logic unit (ALU) performs mathematical calculations; it is the part that computes.
The CPU performs calculations, makes logical comparisons and moves data up to billions of times per second. It works by executing simple instructions one at a time, triggered by a master timing signal that runs the whole computer.
What is an arithmetic-logic unit (ALU)? An arithmetic-logic unit is the part of a central processing unit that carries out arithmetic and logic operations on the operands in computer instruction words.
You're right, a modern x86 will decode add dword [mem], 1
to 3 uops: a load, an ALU add, and a store. (This is actually a simplification of various things, including Intel's micro-fusion and how AMD always keeps a load+ALU together in some parts of the pipeline...)
Those 3 dependent operations can't happen at the same time because the later ones have to wait for the result of the earlier one.
But execution of independent instructions can overlap, and modern CPUs very aggressively look for and exploit "instruction level parallelism" to run your code faster than 1 uop per clock. See this answer for an intro to what a single CPU core can do in parallel, with links to more stuff, like Agner Fog's x86 microarch guide, and David Kanter's write-ups of Sandybridge and Bulldozer.
But if you look at Intel's P6 and Sandybridge microarchitecture families, a store is actually separate store-address and store-data uops. The store-address uop has no dependency on the load or ALU uop, and can write the store address into the store buffer at any time. (Intel's optimization manual calls it the Memory Order Buffer).
To increase front-end throughput, store-address and store-data uops can decode as a micro-fused pair. For add
, so can the load+alu operation, so an Intel CPU can decode add dword [rdi], 1
to 2 fused-domain uops. (The same load+add micro-fusion works for decoding add eax, [rdi]
to a single uop, so any of "simple" decoders can decode it, not just the "complex" decoder that can handle multi-uop instructions. This reduces front-end bottlenecks).
This is why add [mem], 1
is more efficient than inc [mem]
on Intel CPUs, even though inc reg
is just as efficient (but smaller) than add reg,1
. (inc
can't micro-fuse its load+inc, which sets flags differently than add
). INC instruction vs ADD 1: Does it matter?
But this is just helping the front-end get uops into the scheduler more quickly; the load still has to run separately from the add.
But a micro-fused load doesn't have to wait for the rest of the whole instruction's inputs to be ready. Consider an instruction like add [rdi], eax
where RDI and EAX are both inputs to the instruction, but EAX isn't needed until the ALU add uop. The load can execute as soon as the load-address is ready and there's a free load execution unit (AGU + cache access). See also How are x86 uops scheduled, exactly?.
registers are read in Decode uOp, Store/Load in Memory uOp and we allow ourselves to take the value of a register at the Memory uOp
All current x86 microarchitectures use out-of-order execution with register renaming (Tomasulo's algorithm). Instructions are renamed and issued into the out-of-order part of the core (ROB and scheduler).
The physical register file isn't read until an instruction is "dispatched" from the scheduler to an execution unit. (Or for recently-generated inputs, forwarded from other uops.)
Independent instructions can overlap their execution. For example, a Skylake CPU can sustain a throughput of 4 fused-domain / 7 unfused-domain uops per clock, including 2 loads + 1 store, in a carefully crafted loop:
.loop: ; HSW: 1.12c / iter. SKL: 1.0001c
add edx, [rsp] ; 1 fused-domain uop: micro-fused load+add
mov [rax], edi : 1 fused-domain uop: micro-fused store-address+store-data
blsi ebx, [rdi] : 1 fused-domain uop: micro-fused load+bit-manip
dec ecx
jnz .loop ; 1 fused-domain uop: macro-fused dec+branch runs on port 6
Sandybridge-family CPUs have an L1d cache capable of 2 reads + 1 write per clock. (Before Haswell, only 256-bit vectors could work around the AGU throughput limit, though. See How can cache be that fast?.)
Sandybridge-family front-end throughput is 4 fused-domain uops per clock, and they have lots of execution units in the back-end to handle various instruction mixes. (Haswell and later have 4 integer ALUs, 2 load ports, a store-data port, and a dedicated store-AGU for simple store addressing modes. So they can often "catch up" quickly after a cache-miss stalls execution, quickly making room in the out-of-order window to find more work to do.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With