How do modern CPUs like Kaby Lake handle small branches? (in code below it is the jump to label LBB1_67). From what I know the branch will not be harmful because the jump is inferior to the 16-bytes block size which is the size of the decoding window.
Or is it possible that due to some macro op fusion the branch will be completely elided?
sbb rdx, qword ptr [rbx - 8]
setb r8b
setl r9b
mov rdi, qword ptr [rbx]
mov rsi, qword ptr [rbx + 8]
vmovdqu xmm0, xmmword ptr [rbx + 16]
cmp cl, 18
je .LBB1_67
mov r9d, r8d
.LBB1_67: # in Loop: Header=BB1_63 Depth=1
vpcmpeqb xmm0, xmm0, xmmword ptr [rbx - 16]
vpmovmskb ecx, xmm0
cmp ecx, 65535
sete cl
cmp rdi, qword ptr [rbx - 32]
sbb rsi, qword ptr [rbx - 24]
setb dl
and dl, cl
or dl, r9b
There are no special cases for short branch distances in any x86 CPUs. Even unconditional jmp
to the next instruction (architecturally a nop) needs correct branch prediction to be handled efficiently; if you put enough of those in a row you run out of BTB entries and performance falls off a cliff. Slow jmp-instruction
Fetch/decode is only a minor problem; yes a very short branch within the same cache line will still hit in L1i and probably uop cache. But it's unlikely that the decoders would special-case a predicted-taken forward jump and make use of pre-decode instruction-boundary finding from one block that included both the branch and the target.
When the instruction is being decoded to uops and fed into the front-end, register values aren't available; those are only available in the out-of-order execution back-end.
The major problem is that when the instructions after .LBB1_67:
execute, the architectural state is different depending on whether the branch was taken or not.
And so is the micro-architectural state (RAT = Register Allocation Table).
Either:
r9
depends on the sbb
/setl
result (mov r9d, r8d
didn't run)r9
depends on the sbb
/setb
result (mov r9d, r8d
did run)Conditional branches are called "control dependencies" in computer-architecture terminology. Branch-prediction + speculative execution avoids turning control dependencies into data dependencies. If the je
was predicted not taken, the setl
result (the old value of r9
) is overwritten by mov
and is no longer available anywhere.
There's no way to recover from this after detecting a misprediction in the je
(actually should have been taken), especially in the general case. Current x86 CPUs don't try to look for the fall-through path rejoining the taken path or figuring out anything about what it does.
If cl
wasn't ready for a long time, so a mispredict wasn't discovered for a long time, many instructions after the or dl, r9b
could have executed using the wrong inputs. In the general case the only way to reliably + efficiently recover is to discard all work done on instructions from the "wrong" path. Detecting that vpcmpeqb xmm0, [rbx - 16]
for example still runs either way is hard, and not looked for. (Modern Intel, since Sandybridge, has a Branch Order Buffer (BOB) that snapshots the RAT on branches, allowing efficient rollback to the branch miss as soon as execution detects it while still allowing out-of-order execution on earlier instructions to continue during the rollback. Before that a branch miss had to roll back to the retirement state.)
Some CPUs for some non-x86 ISAs (e.g. PowerPC I think) have experimented with turning forward branches that skip exactly 1 instruction into predication (data dependency) instead of speculating past them. e.g. Dynamic Hammock Predication for Non-predicated Instruction Set Architectures discusses this idea, and even deciding whether to predicate or not on a per-branch basis. If your branch-prediction history says this branch predicts poorly, predicating it instead could be good. (A Hammock branch is one that jumps forward over one or a couple instructions. Detecting the exactly 1 instruction case is trivial on an ISA with fixed-width instruction words, like a RISC, but hard on x86.)
In this case, x86 has a cmovcc
instruction, an ALU select operation that produces one of the two inputs depending on a flag condition. cmove r9d, r8d
instead of cmp
/je
would make this immune to branch mispredictions, but at the cost of introducing a data dependency on cl
and r8d
for instructions that use r9d
. Intel CPU don't try to do this for you.
(On Broadwell and later Intel, cmov is only 1 uop, down from 2. cmp/jcc is 1 uop, and the mov
itself is also 1 uop, so in the not-taken case cmov
is also fewer uops for the front-end. And in the taken case, a taken branch can introduce bubbles in the pipeline even if predicted correctly, depending on how high throughput the code is: Whether queues between stages can absorb it.)
See gcc optimization flag -O3 makes code slower than -O2 for a case where CMOV is slower than a branch because introducing a data dependency is bad.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With