What I understand is, there are two types of instruction fusions:
Micro-operations are those operations that can be executed in 1 clock cycle. If several micro-operations are fused, we obtain an "instruction".
If several instructions are fused, we obtain a Macro-operation.
If several macro-operations are fused, we obtain Macro-operation fusing.
Am I correct?
No, fusion is totally separate from how one complex instruction (like cpuid
or lock add [mem], eax
) can decode to multiple uops.
The way the retirement stage figures out that all the uops for a single instruction have retired, and thus the instruction has retired, has nothing to do with fusion.
Macro-fusion decodes cmp/jcc or test/jcc into a single compare-and-branch uop. (Intel and AMD CPUs). The rest of the pipeline sees it purely as a single uop1 (except performance counters still count it as 2 instructions). This saves uop cache space, and bandwidth everywhere including decode. In some code, compare-and-branch makes up a significant fraction of the total instruction mix, like maybe 25%, so choosing to look for this fusion rather than other possible fusions like mov dst,src1
/ or dst,src2
makes sense.
Sandybridge-family can also macro-fuse some other ALU instructions with conditional branches, like add
/sub
or inc
/dec
+ JCC with some conditions. (x86_64 - Assembly - loop conditions and out of order)
Micro-fusion stores 2 uops from the same instruction together so they only take up 1 "slot" in the fused-domain parts of the pipeline. But they still have to dispatch separately to separate execution units. And in Intel Sandybridge-family, the RS (Reservation Station aka scheduler) is in the unfused domain, so they're even stored separately in the scheduler. (See Footnote 2 in my answer on Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths.)
P6 family had a fused-domain RS, as well as ROB, so micro-fusion helped increase the effective size of the out-of-order window there. But SnB-family reportedly simplified the uop format making it more compact, allowing larger RS sizes that are helpful all the time, not just for micro-fused instructions.
And Sandybridge family will "un-laminate" indexed addressing modes under some conditions, splitting them back into 2 separate uops in their own slots before issue/rename into the ROB in the out-of-order back end, so you lose the front-end issue/rename throughput benefit of micro-fusion. See Micro fusion and addressing modes
cmp [rdi], eax
jnz .target
The cmp/jcc can macro-fuse into a single cmp-and-branch ALU uop, and the load from [rdi]
can micro-fuse with that uop.
Failure to micro-fuse the cmp
does not prevent macro-fusion.
The limitations here are: RIP-relative + immediate can never micro-fuse, so cmp dword [static_data], 1
/ jnz
can macro-fuse but not micro-fuse.
A cmp
/jcc
on SnB-family (like cmp [rdi+rax], edx
/ jnz
) will macro and micro-fuse in the decoders, but the micro-fusion will un-laminate before the issue stage. (So it's 2 total uops in both the fused-domain and unfused-domain: load with an indexed addressing mode, and ALU cmp/jnz
). You can verify this with perf counters by putting a mov ecx, 1
in between the CMP and JCC vs. after, and note that uops_issued.any:u
and uops_executed.thread
both go up by 1 per loop iteration because we defeated macro-fusion. And micro-fusion behaved the same.
On Skylake, cmp dword [rdi], 0
/jnz
can't macro-fuse. (Only micro-fuse). I tested with a loop that contained some dummy mov ecx,1
instructions. Reordering so one of those mov
instructions split up the cmp/jcc
didn't change perf counters for fused-domain or unfused-domain uops.
But cmp [rdi],eax
/jnz
does macro- and micro-fuse. Reordering so a mov ecx,1
instruction separates CMP from JNZ does change perf counters (proving macro-fusion), and uops_executed is higher than uops_issued by 1 per iteration (proving micro-fusion).
cmp [rdi+rax], eax
/jne
only macro-fuses; not micro. (Well actually micro-fuses in decode but un-laminates before issue because of the indexed addressing mode, and it's not an RMW-register destination like sub eax, [rdi+rax]
that can keep indexed addressing modes micro-fused. That sub
with an indexed addressing mode does macro- and micro-fuse on SKL, and presumably Haswell).
(The cmp dword [rdi],0
does micro-fuse, though: uops_issued.any:u
is 1 lower than uops_executed.thread
, and the loop contains no nop
or other "eliminated" instructions, or any other memory instructions that could micro-fuse).
Some compilers (including GCC IIRC) prefer to use a separate load instruction and then compare+branch on a register. TODO: check whether gcc and clang's choices are optimal with immediate vs. register.
Micro-operations are those operations that can be executed in 1 clock cycle.
Not exactly. They take 1 "slot" in the pipeline, or in the ROB and RS that track them in the out-of-order back-end.
And yes, dispatching a uop to an execution port happens in 1 clock cycle and simple uops (e.g., integer addition) can complete execution in the same cycle. This can happen for up to 8 uops simultaneously since Haswell, but increased to 10 on Sunny Cove. The actual execution might take more than 1 clock cycle (occupying the execution unit for longer, e.g. FP division).
The divider is I think the only execution unit on modern mainstream Intel that's not fully pipelined, but Knight's Landing has some not-fully-pipelined SIMD shuffles that are single uop but (reciprocal) throughput of 2 cycles.).
Footnote 1:
If cmp [rdi], eax
/ jne
faults on the memory operand, i.e. a #PF
exception, it's taken with the exception return address pointing to before the cmp
. So I think even exception handling can still treat it as a single thing.
Or if the branch target address is bogus, a #PF exception will happen after the branch has already executed, from code fetch with an updated RIP. So again, I don't think there's a way for cmp
to execute successfully and the jcc
to fault, requiring an exception to be taken with RIP pointing to the JCC.
But even if that case is a possibility the CPU needs to be designed to handle, sorting that out can be deferred until the exception is actually detected. Maybe with a microcode assist, or some special-case hardware.
As far as how the cmp/jcc uop goes through the pipeline in the normal case, it works exactly like one long single-uop instruction that both sets flags and conditionally branches.
Surprisingly, the loop
instruction (like dec rcx/jnz
but without setting flags) is not a single uop on Intel CPUs. Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With