Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is the branch delay slot deprecated or obsolete?

When I reading RISC-V User-Level ISA manual,I noticed that it said that "OpenRISC has condition codes and branch delay slots, which complicate higher performance implementations." so RISC-V don't have branch delay slot RISC-V User-Level ISA manual link. Moreover,Wikipedia said that most of newer RISC design omit branch delay slot. Why most of newer RISC Architecture gradually omit branch delay slot?

like image 923
tommycc Avatar asked Feb 16 '19 15:02

tommycc


2 Answers

Citing Henessy and Patterson (Computer architecture and design, 5th ed.)

Fallacy : You can design a flawless architecture.
All architecture design involves trade-offs made in the context of a set of hardware and software technologies. Over time those technologies are likely to change, and decisions that may have been correct at the time they were made look like mistakes. (...) An example in the RISC camp is delayed branch. It was a simple matter to control pipeline hazards with five-stage pipelines, but a challenge for processors with longer pipelines that issue multiple instructions per clock cycle.

Indeed, in terms of software, delayed branch only has drawbacks as it makes programs more difficult to read and less efficient as the slot is frequently filled by nops.

In terms of hardware, it was a technological decision that has some sense in the eighties, when pipeline was 5 or 6 stages and there was no way to avoid the one cycle branch penalty.

But presently, pipelines as much more complex. Branch penalty is 15-25 cycles on recent pentium μarchitectures. One instruction delayed branch is thus useless and it would be a nonsense and clearly impossible to try to hide this delay slot with a 15 instructions delayed branch (that would break instruction sets compatibility).

And we have developed new technologies. Branch prediction is a very mature technology. With present branch predictors, misprediction is by far lower than the number of branches with a useless (nop) delay slot and is accordingly more efficient, even on a 6 cycles computer (like nios-f).

So delayed branches are less efficient in hardware and software. No reason to keep them.

like image 137
Alain Merigot Avatar answered Sep 18 '22 23:09

Alain Merigot


Delay slots are only helpful on a short in-order scalar pipeline, not high-performance superscalar, or especially one with out-of-order execution.

They complicate exception handling significantly (for HW and software), because you need to record current program-counter and separately a next-PC address in case the instruction in the delay slot takes an exception.

They also complicate How many instructions need to be killed on a miss-predict in a 6-stage scalar or superscalar MIPS? by introducing multiple possibilities like the branch-delay instruction is already in the pipeline and needs to not be killed, vs. still waiting on an I-cache miss so re-steering the front-end needs to wait until after it fetched the branch-delay instruction.


Branch-delay slots architecturally expose an implementation detail of in-order classic RISC pipelines to the benefit of performance on that kind of uarch, but anything else has to work around it. It only avoids code-fetch bubbles from taken branches (even without branch prediction) if your uarch is a scalar classic RISC.

Even a modern in-order uarch needs branch prediction for good performance, with memory latency (measured in CPU clock cycles) being vastly higher than in the days of early MIPS.

(Fun fact: MIPS's 1 delay slot was sufficient to hide the total branch latency on R2000 MIPS I, thanks to clever design that kept that down to 1 cycle.)


Branch delay slots can't always be filled optimally by compilers, so even if we can implement them in a high-performance CPU without significant overhead, they do cost throughput in terms of total work done per instruction. Programs will usually need to execute more instructions, not less, with delay slots in the ISA.

(Although sometimes doing something unconditional after the compare-and-branch can allow reuse of the register instead of needing a new register, on an ISA without flags like MIPS where branch instructions test integer registers directly.)

like image 32
Peter Cordes Avatar answered Sep 18 '22 23:09

Peter Cordes