Due to the huge impact on performance, I never wonder if my current day desktop CPU has branch prediction. Of course it does. But how about the various ARM offerings? Does iPhone or android phones have branch prediction? The older Nintendo DS? How about PowerPC based Wii? PS 3?
Whether they have a complex prediction unit is not so important, but if they have at least some dynamic prediction, and whether they do some execution of instructions following an expected branch.
What is the cutoff for CPUs with branch prediction? A hand held calculator from decades ago obviously doesn't have one, while my desktop does. But can anyone more clearly outline where one can expect dynamic branch prediction?
If it is unclear, I am talking about the kind of prediction where the condition is changing, varying the expected path during runtime.
On the SPEC'89 benchmarks, very large bimodal predictors saturate at 93.5% correct, once every branch maps to a unique counter. The predictor table is indexed with the instruction address bits, so that the processor can fetch a prediction for every instruction before the instruction is decoded.
If you don't already use tricks like MIPS I early eval of branch conditions, your branch latency would be 2 cycles (IF to EX) for conditional branches. Static always-taken prediction would shorten that to 1 cycle (IF to ID).
Current NVIDIA GPUs do not support branch prediction.
The branch predictor operates in the Fetch stage of the pipeline so that it can determine which instruction to execute on the next cycle. When it predicts that the branch should be taken, the processor fetches the next instruction from the branch destination stored in the branch target buffer.
Any CPU with a pipeline beyond a few stages requires at least some primitive branch prediction, otherwise it can stall waiting on computation results in order to decide which way to go. The Intel Atom is an in-order core, but with a fairly deep pipeline, and it therefore requires a pretty decent branch predictor.
Old ARM 7 designs were only three stages. Combine that with things like branch delay slots (required on MIPS, optional on SPARC), and branch prediction isn't so useful.
Incidentally, when MIPS decided to get more performance by going beyond 4 pipeline stages, the branch delay slot became an annoyance. In the original design, it was necessary, because there was no branch predictor. Therefore, you had to sequence your branch instruction prior to the last instruction to be executed before the branch. With the longer pipeline, they needed a branch predictor, obviating the need for a branch delay slot, but they had to emulate it anyway in order to run older code.
The problem with a branch delay slot is that it can only be filled with a useful instruction about 50% of the time. The rest of the time, you either fill it with an instruction whose result is likely to be thrown away, or you use a NO-OP.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With