Link between instruction pipelining and cycles per instruction

1 Answers

You touched on quite a few things in your question, so I'll put in my 2 cents to try and make it all a bit clearer. Let's look at an in-order MIPS architecture as an example - it features all of the things you mention except the variable-length instructions.

Many MIPS CPUs have 5-stage pipelines with stages: IF -> ID -> EX -> MEM -> WB. (https://en.wikipedia.org/wiki/Classic_RISC_pipeline). Let's first look at those instructions where each of these stages will generally take a single clock-cycle (this might not be the case on cache misses, for example). For instance, SW (store word to memory), BNEZ (branch on not zero) and ADD (add two registers and store to register). Not all of these instructions have useful work in all pipe stages. For example, SW has no work to do in WB stage, BNEZ can be finished as early as ID stage (that's the earliest the target address can be computed) and ADD has no work in MEM stage.

Regardless of that, each of these instructions will go through each and every stage of the pipeline even if they have no work in some of them. The instruction will occupy a given stage but no actual work will be done (i.e. no result is written to a register in WB stage for SW instruction). In other words, there will be no stalls in this case.

Moving over to more complex instructions whose EX stage can take up to tens of cycles such as MUL or DIV. Things get much trickier here. Now the instructions can get completed out of order even though they are always fetched in order (meaning WAW hazards are now possible). Take the following example:

MUL R1, R10, R11
ADD R2, R5, R6

MUL is fetched first and it reaches the EX stage before ADD, however ADD will get completed way before as MUL's EX stage runs for more than 10 clock-cycles. However, the pipeline won't be stalled at any point as there is no possibility of hazards in this sequence - neither RAW nor WAW hazards are possible. Take another example:

MUL R1, R10, R11
ADD R1, R5, R6

Now both MUL and ADD write the same register. As ADD will complete way earlier than MUL, it will complete WB and write its result. At later point, MUL will do the same and R1 would end up having wrong (old) value. This is where pipeline stall is needed. One way to solve this is to prevent ADD from issuing (moving from ID to EX stage) until MUL enters MEM stage. That's done by freezing or stalling the pipeline. Introducing floating-point operations leads to similar problems in the pipeline.

I'd complete my answer by touching on the topic of fixed-length vs. variable length instruction format (even though you didn't explicitly asked for it). MIPS (and most RISC) CPUs have fixed-length encoding. This tremendously simplifies the implementation of a CPU pipeline, as instructions can be decoded and input registers read within a single cycle (assuming that register locations are fixed in a given instruction format which is true for MIPS). Additionally, the fetching is simplified as instructions are always of the same length so there's no need to start decoding the instruction to find its length.

There are of course disadvantages: the possibility to generate compact binary is reduced which leads to larger programs which in turn leads to poorer cache performance. Additionally, memory traffic is increased as well as more bytes of data are read/written from/to memory which might be important for energy efficient platforms.

This advantage has led to some RISC architectures defining a 16-bit instruction-length mode (MIPS16 or ARM Thumb), or even a variable-length instruction set (ARM Thumb2 has 16-bit and 32-bit instructions). Unlike x86, Thumb2 was designed to make it easy to determine instruction-length quickly, so it's still easy for CPUs to decode.

These compacted ISAs often require more instructions to implement the same program, but take less total space and run faster if code-fetch is more of a bottleneck than instruction throughput in the pipeline. (Small / nonexistent instruction cache, and/or reading from a ROM in an embedded CPU).

answered Nov 05 '22 03:11

dbajgoric

Related questions
                            
                                How to use processor instructions in C++ to implement fast arithmetic operations
                            
                                Which x86 instruction has a 10-byte immediate?
                            
                                What causes this high variability in cycles for a simple tight loop with -O0 but not -O3, on a Cortex-A72?
                            
                                GCC codegen: What does pthread_create_key() have to do with std::shared_ptr copying?
                            
                                Help Writing TSR Program(s) in NASM Assembly for DOS
                            
                                "Art of Exploitation" disassembly example isn't the same (C code) [duplicate]
                            
                                with RIP-addressing, why x86-64 still need relocations?
                            
                                Arm Assembly - Calling function with more than 4 arguments
                            
                                Outputting integers in assembly on Linux
                            
                                Why dont use the AVX Registers as a ultra fast cache?
                            
                                How do the operations LDA, STA, SUB, ADD, MUL and DIV work in Knuth's machine language MIX?
                            
                                How to re-use C structs in ARM assembly in a maintainable and readable way?
                            
                                Why does ICC unroll this loop in this way and use lea for arithmetic?
                            
                                Can the LSD issue uOPs from the next iteration of the detected loop?
                            
                                Linux process stack overrun by local variables (stack guarding)
                            
                                Utilizing the LDT (Local Descriptor Table)
                            
                                Outputting Hello World in MASM using WIN32 Functions
                            
                                Why does .NET use SIMD and not x87 for math operations not intrinsic to SIMD?
                            
                                How can I trap to the debugger and continue on iOS hardware?
                            
                                MIPS: Why do we need load byte when we already have load word?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Link between instruction pipelining and cycles per instruction

Tags:

cpu-architecture

assembly

executable

cpu

Bilow

People also ask

1 Answers

dbajgoric

Recent Activity

Donate For Us