Does a branch misprediction flush the entire pipeline, even for very short if-statement body?

Tags:

Everything I've read seems to indicate that a branch misprediction always results in the entire pipeline being flushed, which means a lot of wasted cycles. I never hear anyone mention any exceptions for short if-conditions.

This seems like it would be really wasteful in some cases. For example, suppose you have a lone if-statement with a very simple body that is compiled down to 1 CPU instruction. The if-clause would be compiled into a conditional jump forward by one instruction. If the CPU predicts the branch to not be taken, then it will begin executing the if-body instruction, and can immediately begin executing the following instructions. Now, once evaluation of the if-condition has reached the end of the pipeline, which could be, say, 12 cycles later, the CPU now knows whether it's prediction was right or wrong. If it mispredicted, and the branch was actually taken, then the CPU really only has to discard 1 instruction from the pipeline (the one in the if-body). However, if it flushes the entire pipeline, then all the work that was done on the following instructions was wasted as well, and will have to be repeated for no reason. That's a lot of wasted cycles on a deeply pipelined architecture.

So do modern CPUs have any mechanism to discard only the few instructions that are inside of a short if-body? Or does it really flush the entire pipeline? If it's the latter, then I suppose using a conditional move instruction would get better performance. As an aside, does anyone know if modern compilers are good at converting short if-statements into cmov instructions?

810

asked Apr 08 '15 18:04

Norg74

2 Answers

Most general purpose processors do flush the pipeline on a branch misprediction. The negative performance impact of conditional branches has motivated proposals for eager execution (where both paths are executed and the correct path selected later) and dynamic predication (where instructions in the branch shadow are predicated) in addition to extensive research on branch prediction (as well as other techniques). (Mark Smotherman's page on eager execution provides some details and references. I would add Hyesoon Kim et al.'s "Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution", 2005, as a significant paper.)

IBM's POWER7 seems to be the first mainstream processor to implement anything more sophisticated than prefetching an alternate path (i.e., eager fetch), and it only handles the single instruction case. (POWER7 uses a branch prediction confidence estimate to choose whether to predicate or use prediction.)

Eager execution has the obvious problem of exploding resource use. Even with selective eagerness based on branch prediction confidence, speculation depth, and resource availability (information available to the front-end), it can easily be more effective to speculate deeper down a single path. Discovering the joining points of multiple paths and avoiding excessive redundant computation can also add complexity. (Ideally, control independent operations would only be executed once and joining and data flow would be optimized, but such optimization adds complexity.)

For a deeply pipelined in-order processor, it may seem attractive to predict short forward branches as not taken and only flush backward in the pipeline to the instruction targeted by the taken branch when the branch is actually taken. If only one such branch is allowed in the pipeline at a time (other branches uses prediction), adding a single bit to each instruction could control whether it is converted to a nop or executed. (If only the case of a single instruction being branched over is handled, allowing multiple branches in the pipeline might not be especially complex.)

This would be similar to annul-if-taken branch delay slots. MIPS has "Branch Likely" instructions that annulled if not taken, and these are marked as obsolete in Revision 2.62. While some of the justification for such is presumably to separate implementation from interface and the desire to recover instruction encoding space, this decision also hints that the concept has some issues.

If this was done for all short forward branches, it would throw away instructions when the branch was correctly predicted as taken. (Note that this penalty could be less if taken branches always experience a delay in fetch redirection, which would be more likely with a multi-cycle instruction cache access in a deeply pipelined processor. In that case, fetching as if there was no branch could have the same performance as a correctly predicted taken branch. However, one could argue that the processor special case such short taken branches to minimize such fetch bubbles.)

As an example consider a scalar pipeline (non-branch instructions per cycle equal to 1.0) with branch resolution at the end of the eighth stage and no fetch redirection penalty on correctly predicted taken branches, handling single-instruction branch-overs. Assume 75% branch predictor accuracy (unbiased by direction) for such short forward branches (2% of instructions, taken 30% of the time) and 93% accuracy for other branches (18% of instructions). Eight cycles would be saved for short branches that would be mispredicted as taken (17.5% of such branches; 0.35% of instructions), seven cycles when mispredicted as not taken (7.2%; 0.144%), and one cycle would be lost when correctly predicted as taken (22.5%; 0.45%). In total 0.03358 cycles per instruction would be saved. Without this optimization the cycles per instruction would be 1.2758.

(While the above numbers are just for example, they are probably not far from reality except for the 1.0 IPC for non-branch instructions. Providing a small loop cache would reduce the misprediction penalty (and save power in short loops) because instruction cache access would probably be three of the eight cycles. Adding the effect of cache misses would further reduce the percentage improvement from this branch optimization. Avoiding the overhead for predicted "strongly taken" short branches might be worthwhile.)

In order designs tend to use narrow and shallower pipelines and prefer simplicity (for lower design, power, and area costs). Since the instruction set is likely to support branchless code for many short-branch cases, the incentive to optimize this aspect is further decreased.

For out-of-order implementations, the potentially branched over instructions would have to be predicated since the processor would want to be able to execute later non-dependent instructions. Predication introduces an additional data dependency which must be checked for scheduling. It is common for instruction schedulers to provide only two comparators per instruction and to split a conditional move (a simple instruction with only three data flow operands: the old value, the alternative value, and the condition; a predicated register-register add would have four operands. (There are alternative ways of addressing this issue, but this answer is already long.)

An out-of-order implementation would also not stall when a branch condition is not available. This is a tradeoff between a control dependency and a data dependency. With accurate branch prediction a control dependency is extremely inexpensive, but a data dependency can hold up forward progress waiting on data operands. (Of course, with a boolean data dependency, value prediction becomes somewhat more attractive. Using predicate prediction might be desirable in some cases and would have the advantage over simple predication of using dynamic cost and confidence estimates.)

(It is perhaps telling that ARM chose to drop extensive predication in 64-bit AArch64. While a large part of this is for instruction encoding, the benefit of predication for high-performance implementations is presumably relatively low.)

Compiler issues

The performance of branchless versus branching code depends on the predictability of the branch and other factors (including, if taken, any penalty for redirecting fetch), but it is difficult for the compiler to determine the predictability of a branch. Even profile data typically only provides branch frequencies which can give a pessimistic view of predictability since such does not account for the branch predictor using local or global history. A compiler is also not perfectly aware of timing of data availability and other dynamic aspects. If the condition is available later than the operands used for computation, then replacing a control dependence (branch prediction) with a data dependence (predication) could degrade performance. Branchless code may also introduce more live values, potentially adding register spill and fill overhead.

Complicating this further, most instruction sets that only provide conditional move or select instructions do not provide a conditional store. While this can be worked around by using conditional move to select a safe, ignored storage location, such seems an unattractive complication. In addition, conditional move instructions are often more expensive than simple arithmetic instructions; an addition and conditional move might take three cycles where a correctly predicted branch and addition would take zero (if addition is branched over) or one cycle.

A further complication is that predicated operations are generally ignored by the branch predictor. If a later retained branch correlates with the condition of the removed branch, the branch misprediction rate may increase for that later branch. (Predicate prediction could be used to retain the predictor effects of such removed branches.)

With the increased emphasis on vectorization, the use of branchless code becomes even more significant since branch-based code constrains the ability to use operations on an entire vector.

116

answered Nov 05 '22 22:11

Paul A. Clayton

Modern high-performance out-of-order CPUs usually do not flush the entire pipeline⁰ on a misprediction, but it doesn't really depend on the distance of the branch or work as you suggest.

They generally use something similar to the strategy of flushing the branch instruction and all younger instructions. The front-end is flushed, this this will be full of instructions on the mispredicted path, but beyond the front-end modern cores may have more than 100 instructions in-flight at once, only some of which may be younger than the branch.

This means that the cost of the branch is at least partly related to the surrounding instructions: if the branch condition can be checked early the impact of a mis-prediction can be limited or even zero¹. On the other hand, if the branch condition is handled late, after considerable resources have been spent on the wrong path, the cost can be large (e.g., larger than the 12-20 cycle "published" branch misprediction penalty you'll often see).

⁰ The exact terminology is up for debate here: the meaning of flushing the pipeline isn't entirely clear for out-of-order processors. Here I mean that the CPU does not flush all in-flight-but-possibly-not-executed instructions.

¹ In particular, the limiting factor for some sequence of instructions could be a dependency chain whose current execution is far enough behind the leading edge of the instruction window that the misprediction doesn't flush any of those instructions and doesn't slow down the code at all.

answered Nov 05 '22 22:11

BeeOnRope

Related questions
                            
                                How to sort an array of integers faster than quicksort?
                            
                                Automapper performance
                            
                                VS2010 loads slowly. Can I profile extensions' respective startup time?
                            
                                Why is Windows not considered suitable for real time systems/high performance servers? [closed]
                            
                                Storing Python objects in a Python list vs. a fixed-length Numpy array
                            
                                How much faster is the native implementation of the native cryptographic hashes on Windows than the .Net Managed version?
                            
                                Repeat a block of code a fixed number of times
                            
                                Why choosing `unwrap_or_else` over `unwrap_or`? [duplicate]
                            
                                Why is a ternary operator with two constants faster than one with a variable?
                            
                                repeated calling - coding practice
                            
                                For parallel algorithm with N threads, can performance gain be more than N?
                            
                                Profiling a long-running Python Server
                            
                                What caused my elapsed time much longer than user time?
                            
                                Performance of C++11 modern-style loops vs old-style loops
                            
                                How can I prevent rbind() from geting really slow as dataframe grows larger?
                            
                                Java loop efficiency
                            
                                Updating a python dictionary while adding to existing keys?
                            
                                Is there an actual example where inline is detrimental to the performance of a C program?
                            
                                How to interpret CPU time vs CPU percentage
                            
                                Removing Unused HTTP Handlers for Better Performance & Security

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does a branch misprediction flush the entire pipeline, even for very short if-statement body?

Tags:

performance

cpu-architecture

branch-prediction

branch

x86