Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding CPU pipeline stages vs. Instruction throughput

I'm missing something fundamental re. CPU pipelines: at a basic level, why do instructions take differing numbers of clock cycles to complete and how come some instructions only take 1 cycle in a multi-stage CPU?

Besides the obvious of "different instructions require a different amount of work to complete", hear me out...

Consider an i7 with an approx 14 stage pipeline. That takes 14 clock cycles to complete a run-through. AFAIK, that should mean the entire pipeline has a latency of 14 clocks. Yet this isn't the case.

An XOR completes in 1 cycle and has a latency of 1 cycle, indicating it doesn't go through all 14 stages. BSR has a latency of 3 cycles, but a throughput of 1 per cycle. AAM has a latency of 20 cycles (more that the stage count) and a throughput of 8 (on an Ivy Bridge).

Some instructions cannot be issued every clock, yet take less than 14 clocks to complete.

I know about the multiple execution units. I don't understand how the length of instructions in terms of latency and throughput relate to the number of pipline stages.

like image 240
IamIC Avatar asked Oct 31 '25 19:10

IamIC


2 Answers

I think what's missing from the existing answers is the existence of "bypass" or "forwarding" datapaths. For simplicity, let's stick with the MIPS 5-stage pipeline. Every instruction takes 5 cycles from birth to death -- fetch, decode, execute, memory, writeback. So that's how long it takes to process a single instruction.

What you want to know is how long it takes for one instruction to hand off its result to a dependent instruction. Say you have two consecutive ADD instructions, and there's a dependency through R1:

ADD R1, R2, R3
ADD R4, R1, R5

If there were no forwarding paths, we'd have to stall the second instruction for multiple cycles (2 or 3 depending on how writeback works), so that the first one could store its result into the register file before the second one reads that as input in the decode stage.

However, there are forwarding paths that allow valid results (but ones that are not yet written back) to be picked out of the pipeline. So let's say the first ADD gets all its inputs from the register file in decode. The second one will get R5 out of the register file, but it'll get R1 out of the pipeline register following the execute stage. In other words, we're routing the output of the ALU back into its input one cycle later.

Out-of-order processors make ubiquitous use of forwarding. They will have lots of different functional units that have lots of different latencies. For instance, ADD and AND will typically take one cycle (TO DO THE MATH, putting aside all of the pipeline stages before and after), MUL will take like 4, floating point operations will take lots of cycles, memory access has variable latency (due to cache misses), etc.

By using forwarding, we can limit the critical path of an instruction to just the latencies of the execution units, while everything else (fetch, decode, retirement), it out of the critical path. Instructions get decoded and dumped into instruction queues, awaiting their inputs to be produced by other executing instructions. When an instruction's dependency is satisfied, then it can begin executing.

Let's consider this example

MUL R1,R5,R6
ADD R2,R1,R3
AND R7,R2,R8

I'm going to make an attempt at drawing a timeline that shows the flow of these instructions through the pipeline.

MUL  FDIXXXXWR
ADD   FDIIIIXWR
AND    FDIIIIXWR

Key:

F - Fetch
D - Decode
I - Instruction queue (IQ)
X - execute
W - writeback/forward/bypass
R - retire

So, as you see, the multiply instruction has a total lifetime of 9 cycles. But there is overlap in execution of the MUL and the ADD, because the processor is pipelined. When the ADD enters the IQ, it has to wait for its input (R1), and likewise so does the AND that is dependent on the ADD's result (R2). What we care about is not how long the MUL lives in total but how long any dependent instruction has to wait. That is its EFFECTIVE latency, which is 4 cycles. As you can see, once the ADD executes, the dependent AND can execute on the next cycle, again due to forwarding.

like image 196
Timothy Miller Avatar answered Nov 02 '25 17:11

Timothy Miller


I'm missing something fundamental re. CPU pipelines: at a basic level, why do instructions take differing numbers of clock cycles to complete and how come some instructions only take 1 cycle in a multi-stage CPU?

Because what we're interested in is in speed between instructions, not the start to end time of a single instruction.

Besides the obvious of "different instructions require a different amount of work to complete", hear me out...

Well that's the key answer to why different instructions have different latencies.

Consider an i7 with an approx 14 stage pipeline. That takes 14 clock cycles to complete a run-through. AFAIK, that should mean the entire pipeline has a latency of 14 clocks. Yet this isn't the case.

That is correct, though that's not a particularly meaningful number. For example, why do we care how long it takes before the CPU is entirely done with an instruction? That has basically no effect.

An XOR completes in 1 cycle and has a latency of 1 cycle, indicating it doesn't go through all 14 stages. BSR has a latency of 3 cycles, but a throughput of 1 per cycle. AAM has a latency of 20 cycles (more that the stage count) and a throughput of 8 (on an Ivy Bridge).

This is just a bunch of misunderstandings. An XOR introduces one cycle of latency into a dependency chain. That is, if I do 12 instructions that each modify the previous instruction's value and then add an XOR as the 13th instruction, it will take one cycle more. That's what the latency means.

Some instructions cannot be issued every clock, yet take less than 14 clocks to complete.

Right. So?

I know about the multiple execution units. I don't understand how the length of instructions in terms of latency and throughput relate to the number of pipline stages.

They don't. Why should there be any connection? Say there's 14 extra stages at the beginning of the pipeline. Why would that effect latency or throughput at all? It would just mean everything happens 14 clock cycles later, but still at the same rate. (Though likely it would impact the cost of a mispredicted branch and other things.)

like image 22
David Schwartz Avatar answered Nov 02 '25 18:11

David Schwartz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!