Why is a conditional move not vulnerable for Branch Prediction Failure?

Tags:

After reading this post (answer on StackOverflow) (at the optimization section), I was wondering why conditional moves are not vulnerable for Branch Prediction Failure. I found on an article on cond moves here (PDF by AMD). Also there, they claim the performance advantage of cond. moves. But why is this? I don't see it. At the moment that that ASM-instruction is evaluated, the result of the preceding CMP instruction is not known yet.

551

asked Jan 02 '13 23:01

Martijn Courteaux

1 Answers

Mis-predicted branches are expensive

A modern processor generally executes between one and three instructions each cycle if things go well (if it does not stall waiting for data dependencies for these instructions to arrive from previous instructions or from memory).

The statement above holds surprisingly well for tight loops, but this shouldn't blind you to one additional dependency that can prevent an instruction to be executed when its cycle comes: for an instruction to be executed, the processor must have started to fetch and decode it 15-20 cycles before.

What should the processor do when it encounters a branch? Fetching and decoding both targets does not scale (if more branches follow, an exponential number of paths would have to be fetched in parallel). So the processor only fetches and decodes one of the two branches, speculatively.

This is why mis-predicted branches are expensive: they cost the 15-20 cycles that are usually invisible because of an efficient instruction pipeline.

Conditional move is never very expensive

Conditional move does not require prediction, so it can never have this penalty. It has data dependencies, same as ordinary instructions. In fact, a conditional move has more data dependencies than ordinary instructions, because the data dependencies include both “condition true” and “condition false” cases. After an instruction that conditionally moves r1 to r2, the contents of r2 seem to depend on both the previous value of r2 and on r1. A well-predicted conditional branch allows the processor to infer more accurate dependencies. But data dependencies typically take one-two cycles to arrive, if they need time to arrive at all.

Note that a conditional move from memory to register would sometimes be a dangerous bet: if the condition is such that the value read from memory is not assigned to the register, you have waited on memory for nothing. But the conditional move instructions offered in instruction sets are typically register to register, preventing this mistake on the part of the programmer.

answered Oct 23 '22 22:10

Pascal Cuoq

Related questions
                            
                                Which is more efficient: Return a value vs. Pass by reference?
                            
                                When should I use ConcurrentSkipListMap?
                            
                                str performance in python
                            
                                How can Google be so fast?
                            
                                What are stalled-cycles-frontend and stalled-cycles-backend in 'perf stat' result?
                            
                                Why is the new Tuple type in .Net 4.0 a reference type (class) and not a value type (struct)
                            
                                Fastest way to grow a numpy numeric array
                            
                                What is the recommended batch size for SqlBulkCopy?
                            
                                Do compilers produce better code for do-while loops versus other types of loops?
                            
                                Endless sine generation in C
                            
                                If statement vs if-else statement, which is faster?
                            
                                How can i optimize MySQL's ORDER BY RAND() function?
                            
                                How do you interpret a query's explain plan?
                            
                                If registers are so blazingly fast, why don't we have more of them?
                            
                                Performance of ThreadLocal variable
                            
                                Max files per directory in S3
                            
                                How to profile Javascript now that JSPerf is down? [closed]
                            
                                Compiled C# Lambda Expressions Performance
                            
                                value of using React.forwardRef vs custom ref prop
                            
                                Is multiplication faster than float division? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is a conditional move not vulnerable for Branch Prediction Failure?

Tags:

performance

cpu-architecture

branch-prediction

optimization

assembly