I understand the basic principle of instruction pipelining.
I also get that some instructions may take longer to execute (cycles per instruction).
But I don't get the link between both.
All pipeline diagrams that I see seem to have "perfect" instructions, they all have the same length (number of cycles).
But what if the first instruction does take 5 cycles, and the second one takes 3 cycles ? Does the cpu stalls for 2 cycles ?
Would this stall be called a bubble ? Or is this different from hazards and data dependencies ?
Also, does the length of an instruction, in bytes, matter in any way ?
Pipelining is the process of accumulating instruction from the processor through a pipeline. It allows storing and executing instructions in an orderly process. It is also known as pipeline processing. Pipelining is a technique where multiple instructions are overlapped during execution.
Cycles per instruction (CPI) is actually a ratio of two values. The numerator is the number of cpu cycles uses divided by the number of instructions executed.
Pipelining reduces the cycle time to the length of the longest stage plus the register delay. Latency becomes CT*N where N is the number of stages as one instruction will need to go through each of the stages and each stage takes one cycle. The throughput formula remains the same.
Cycles per instruction, or CPI, as defined in Fig. 14.2 is a metric that has been a part of the VTune interface for many years. It tells the average number of CPU cycles required to retire an instruction, and therefore is an indicator of how much latency in the system affected the running application.
You touched on quite a few things in your question, so I'll put in my 2 cents to try and make it all a bit clearer. Let's look at an in-order MIPS architecture as an example - it features all of the things you mention except the variable-length instructions.
Many MIPS CPUs have 5-stage pipelines with stages: IF -> ID -> EX -> MEM -> WB
. (https://en.wikipedia.org/wiki/Classic_RISC_pipeline). Let's first look at those instructions where each of these stages will generally take a single clock-cycle (this might not be the case on cache misses, for example). For instance, SW (store word to memory), BNEZ (branch on not zero) and ADD (add two registers and store to register). Not all of these instructions have useful work in all pipe stages. For example, SW has no work to do in WB stage, BNEZ can be finished as early as ID stage (that's the earliest the target address can be computed) and ADD has no work in MEM stage.
Regardless of that, each of these instructions will go through each and every stage of the pipeline even if they have no work in some of them. The instruction will occupy a given stage but no actual work will be done (i.e. no result is written to a register in WB stage for SW instruction). In other words, there will be no stalls in this case.
Moving over to more complex instructions whose EX stage can take up to tens of cycles such as MUL or DIV. Things get much trickier here. Now the instructions can get completed out of order even though they are always fetched in order (meaning WAW hazards are now possible). Take the following example:
MUL R1, R10, R11
ADD R2, R5, R6
MUL is fetched first and it reaches the EX stage before ADD, however ADD will get completed way before as MUL's EX stage runs for more than 10 clock-cycles. However, the pipeline won't be stalled at any point as there is no possibility of hazards in this sequence - neither RAW nor WAW hazards are possible. Take another example:
MUL R1, R10, R11
ADD R1, R5, R6
Now both MUL and ADD write the same register. As ADD will complete way earlier than MUL, it will complete WB and write its result. At later point, MUL will do the same and R1 would end up having wrong (old) value. This is where pipeline stall is needed. One way to solve this is to prevent ADD from issuing (moving from ID to EX stage) until MUL enters MEM stage. That's done by freezing or stalling the pipeline. Introducing floating-point operations leads to similar problems in the pipeline.
I'd complete my answer by touching on the topic of fixed-length vs. variable length instruction format (even though you didn't explicitly asked for it). MIPS (and most RISC) CPUs have fixed-length encoding. This tremendously simplifies the implementation of a CPU pipeline, as instructions can be decoded and input registers read within a single cycle (assuming that register locations are fixed in a given instruction format which is true for MIPS). Additionally, the fetching is simplified as instructions are always of the same length so there's no need to start decoding the instruction to find its length.
There are of course disadvantages: the possibility to generate compact binary is reduced which leads to larger programs which in turn leads to poorer cache performance. Additionally, memory traffic is increased as well as more bytes of data are read/written from/to memory which might be important for energy efficient platforms.
This advantage has led to some RISC architectures defining a 16-bit instruction-length mode (MIPS16 or ARM Thumb), or even a variable-length instruction set (ARM Thumb2 has 16-bit and 32-bit instructions). Unlike x86, Thumb2 was designed to make it easy to determine instruction-length quickly, so it's still easy for CPUs to decode.
These compacted ISAs often require more instructions to implement the same program, but take less total space and run faster if code-fetch is more of a bottleneck than instruction throughput in the pipeline. (Small / nonexistent instruction cache, and/or reading from a ROM in an embedded CPU).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With