I read that a perfectly predicted branch has zero / almost-zero overhead. (For example in an answer on Effects of branch prediction on performance?)
I don't quite understand what people mean by this. At least the branch condition has to be evaluated, which can be a simple bool or a function call, that takes time.
Evaluating a branch condition always takes some work, even if perfectly predicted, but because of the internal parallelism in modern CPUs extra work doesn't necessary add to the cost of a particular instruction sequence.
I think part of the confusion lies in the mental performance model many people have for the execution of CPU instructions. Yes, every instruction requires some work, so that should imply that every instruction has some cost, however small, when measured in execution time, right?
Well that would be true if the total cost of execution was simply additive in the work for each instruction - you just add together all the work and get the final cost. Because of the large about of parallelism in modern CPUs it doesn't work like that.
Think of it like organizing a birthday party. You may have to buy flour which takes 10 minutes and then bake a cake which takes 60 minutes, and go pick up a special gift which is 30 minutes away. Those timings are all the "work" required for the activity. However, someone can go pick up the gift while the flour is being picked up and the cake is being being baked. You can't bake the cake without the flour, however. So you have two dependency chains: the 70 minute buy flour -> bake cake chain, and the 30 minute pickup gift chain. With unlimited parallelism, only the 70 minute cake related chain contributes to the time at which everything is ready. Picking up the gift 30 minutes of work but it ends up costing no time (not delaying completion of all the tasks), due to other work that takes longer (aka the critical path) and happens in parallel.
More extra tasks can be done in parallel until you run out of people to assign to them. (At that point, execution throughput limits start to increase latency, and this is called a resource conflict. If a resource conflict delays the critical path, rather than one of the shorter dependency chains. CPUs don't know which dependency chain is / will be the critical path, so their scheduling doesn't prioritize it the way smart humans would in this planning analogy.)
For a less abstract and more practical look at look at how this stuff applies directly to CPUs, see A Whirlwind Introduction to Dataflow Graphs.
Once we have this new mental model where the cost of an instruction sequence is often dominated by the some critical path though the sequence, we can start to see why well-predicted branches are often very low or zero cost:
Those factors combine to make most predicted branch instructions zero cost or nearly zero cost.
You don't have to take my word for it, let's look at a real example:
int mul1(int count, int x) {
do {
x *= 111;
} while (--count);
return x;
}
Given a count
and a starting value x
, it multiplies x
by 111 count
times and returns the result. The loop assembles to 3 instructions One for the multiply, one for the --count
and a branch to check the count
value:
.L2:
imul eax, eax, 111
sub edi, 1
jne .L2
Now here's the same loop, but with an additional branch:
int mul2(int count, int x) {
do {
x *= 111;
if (x == 0) {
abort();
}
} while (--count);
return x;
}
This assembles to 5 instructions. The extra two are for the test of x
and the branch the test shows that x
is zero:
.L7:
imul eax, eax, 111
test eax, eax
je .L12 ; ends up calling abort
sub edi, 1
jne .L7
So what is the cost of adding 60% more instructions, including a branch? Zero, at least to 4 significant digits3:
Running benchmarks groups using timer libpfc
** Running benchmark group stackoverflow tests **
Benchmark Cycles
No branch 3.000
Added test-branch 3.000
The look takes 3 cycles per iteration, because it is limited by the dependency chain involving 3-cycle multiply. The additional instructions and branch didn't cost anything because they didn't add to this dependency chain and were able to execute "out of line", hiding behind the latency of the critical path.
1 Conceptually, branch instructions write the "rip" register, but this isn't treated like the other registers at all: its progression is predicted ahead of time, so the dependency is broken by the predictor.
2 Of course, there is still additional work to decode and fuse the instruction in the first place, but this is often not the bottleneck so may be "free" in cost terms, and things like uop caches means that it may not even be performed frequently. Also, on x86, while a fused branch instruction has the same latency as an ALU op, it is less flexible in terms of which ports it can execute on, so depending on port pressure it may be the case that a fused instruction has some cost compared to the bare ALU op.
3 In fact, if you go to "infinite" significant digits and look at raw cycle counts, you see that additional iterations of this loop cost exactly 3 cycles in both cases. The no-branch case does usually end up 1 cycle shorter overall (a difference that goes to 0 in a relative sense as the iterations increase), perhaps because the initial non-steady-state iteration takes an additional cycle, or the misprediction recovery takes an additional cycle on the final iteration.
Branch prediction is predicting the OUTCOME of your condition at instruction level, which is the actual result in the C or C++ condition - if that is the result of a million character string comparison, it's probably not particularly useful, because that comparison is going to take a lot of time. If it's the end condition of a the for loop that iterates over the two strings with a million characters each, then it's highly useful, because it happens many times in that loop (assuming the strings are equal).
It is NOT free to make a string comparison of two long strings. It is free to guess correctly that the string comparison will continue (until we either find the end of the string or a difference, at which point the branch prediction goes wrong).
A "unpredictable" branch will lead to the processor not knowing where the code continues to. Modern CPUs have fairly long pipelines (15-30 steps), so if the pipeline isn't filled with the "right" code, the processor will have to wait for the "right" code to trickle through the pipeline.
So to answer the actual queston:
When the branch itself is well predicted, the processor has already got the RIGTH instructions in the pipeline, and there is no "pipeline bubble" to walk through the pipeline before we can execute the correct instructions for continuing the program. See below for an analogy. If the prediction is wrong, there will be something other than the right instructions in the pipeline, and the processor has to chew its way through those, throwing them away.
Think of it as a car factory, making cars of models A and B, in a production line that first mounts the body onto a chassis, paints it (magic paint, it dries almost instantly), then fits the engine and gearbox, and then puts wheels on it, fits the lights, and finally the glass is fitted, and it's a complete car. Each step takes 20 minutes to perform, and the conveyor belt for the cars will move forward to the next position every 20 minutes [for this exercise, we ignore the fact that the move itself takes time].
You are in charge of the production line, and you have a bunch of A cars on the production line. Suddenly, the big boss says, "we've just got an order for B cars, change to making model B immediately". So, you start feeding B car parts onto the production line. But it will still take quite some time before the next B car comes out at the other end.
The way branch prediction works is that it "guesses" whether the code will change direction or go to the next instruction. If it guesses right, it's like guessing when "big boss" comes down to tell you to change between A and B models of cars, so you can have the right car ready to pop off the production line when the boss wants it, rather than having to wait for the whole production line to
When it works, it's great, because the expected stuff is ready to be done. If you guess wrong, you've still got to wait for the rest of the production line to run through the current set, and stash those into the "we don't have a customer for these" corner (or in terms of CPU instructions "discard the instruction").
Most modern CPU's also allow for "speculative execution". That means that the processor will start executing instructions BEFORE the condition is actually determined. So the processor will switch from A cars to working on B cars before the boss says so. If at that point, the boss says "no, you should keep working on A cars", you have a bunch of cars to discard, that you already started work on. You don't necessarily have to build all the cars, but you have to move them through the production line, one step every 20 minutes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With