Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How are branch mispredictions handled before a hardware interrupt

A hardware interrupt occurs to a particular vector (not masked), CPU checks IF flag and pushes RFLAGS, CS and RIP to the stack, meanwhile there are still instructions completing in the back end, one of these instruction's branch predictions turns out to be wrong. Usually the pipeline would be flushed and the front end starts fetching from the correct address but in this scenario an interrupt is in progress.

When an interrupt occurs, what happens to instructions in the pipeline?

I have read this and clearly a solution is to immediately flush everything from the pipeline so that this doesn't occur and then generate the instructions to push the RFLAGS, CS, RIP to the location of the kernel stack in the TSS; however, the question arises, how does it know the (CS:)RIP associated with the most recent architectural state in order to be able to push it on the stack (given that the front end RIP would now be ahead). This is similar to the question of how the taken branch execution unit on port0 knows the (CS:)RIP of what should have been fetched when the take prediciton turns out to be wrong -- is the address encoded into the instruction as well as the prediction? The same issue arises when you think of a trap / exception, the CPU needs to push the address of the current instruction (fault) or the next instruction (trap) to the kernel stack, but how does it work out the address of this instruction when it is halfway down the pipeline -- this leads me to believe that the address must be encoded into the instruction and is worked out using the length information and this is possibly all done at predecode stage..

like image 502
Lewis Kelsey Avatar asked Mar 04 '23 15:03

Lewis Kelsey


2 Answers

The CPU will presumably discard the contents of the ROB, rolling back to the latest retirement state before servicing the interrupt.

An in-flight branch miss doesn't change this. Depending on the CPU (older / simpler), it might have already been in the process of rolling back to retirement state and flushing because of a branch miss, when the interrupt arrived.

As @Hadi says, the CPU could choose at that point to retire the branch (with the interrupt pushing a CS:RIP pointing to the correct branch target), instead of leaving it to be re-executed after returning from the interrupt.

But that only works if the branch instruction was already ready to retire: there were no instructions older than the branch still not executed. Since it's important to discover branch misses as early as possible, I assume branch recovery starts when it discovers a mispredict during execution, not waiting until it reaches retirement. (This is unlike other kinds of faults: e.g. Meltdown and L1TF are based on a faulting load not triggering #PF fault handling until it reaches retirement so the CPU is sure there really is a fault on the true path of execution. You don't want to start an expensive pipeline flush until you're sure it wasn't in the shadow of a mispredict or earlier fault.)

But since branch misses don't take an exception, redirecting the front-end can start early before we're sure that the branch instruction is part of the right path in the first place.

e.g. cmp byte [cache_miss_load], 123 / je mispredicts but won't be discovered for a long time. Then in the shadow of that mispredict, a cmp eax, 1 / je on the "wrong" path runs and a mispredict is discovered for it. With fast recovery, uops past that are flushed and fetch/decode/exec from the "right" path can start before the earlier mispredict is even discovered.


To keep IRQ latency low, CPUs don't tend to give in-flight instructions extra time to retire. Also, any retired stores that still have their data in the store buffer (not yet committed to L1d) have to commit before any stores by the interrupt handler can commit. But interrupts are serializing (I think), and any MMIO or port-IO in a handler will probably involve a memory barrier or strongly-ordered store, so letting more instructions retire can hurt IRQ latency if they involve stores. (Once a store retires, it definitely needs to happen even while its data is still in the store buffer).


The out-of-order back-end always knows how to roll back to a known-good retirement state; the entire contents of the ROB are always considered speculative because any load or store could fault, and so can many other instructions1. Speculation past branches isn't super-special.

Branches are only special in having extra tracking for fast recovery (the Branch Order Buffer in Nehalem and newer) because they're expected to mispredict with non-negligible frequency during normal operation. See What exactly happens when a skylake CPU mispredicts a branch? for some details. Especially David Kanter's quote:

Nehalem enhanced the recovery from branch mispredictions, which has been carried over into Sandy Bridge. Once a branch misprediction is discovered, the core is able to restart decoding as soon as the correct path is known, at the same time that the out-of-order machine is clearing out uops from the wrongly speculated path. Previously, the decoding would not resume until the pipeline was fully flushed.


(This answer is intentionally very Intel-centric because you tagged it intel, not x86. I assume AMD does something similar, and probably most out-of-order uarches for other ISAs are broadly similar. Except that memory-order mis-speculation isn't a thing on CPUs with a weaker memory model where CPUs are allowed to visibly reorder loads.)


Footnote 1: So can div, or any FPU instruction if FP exceptions are unmasked. And a denormal FP result could require a microcode assist to handle, even with FP exceptions masked like they are by default.

On Intel CPUs, a memory-order mis-speculation can also result in a pipeline nuke (load speculatively done early, before earlier loads complete, but the cache lost its copy of the line before the x86 memory model said the load could take its value).

like image 184
Peter Cordes Avatar answered Mar 07 '23 09:03

Peter Cordes


In general, each entry in the the ReOrder Buffer (ROB) has a field that is used to store enough information about the instruction address to reconstruct the whole instruction address unambiguously. It may be too costly to store the whole address for each instruction in the ROB. For instructions that have not yet been allocated (i.e., not yet passed the allocation stage of the pipeline), they need to carry this information with them at least until they reach the allocation stage.

If an interrupt and a branch misprediction occur at the same time, the proessor may, for example, choose to service the interrupt. In this case, all the instructions that are on the mispredicted path need to be flushed. The processor may choose also to flush other instructions that are on the correct path, but have not yet retired. All of these instructions are in the ROB and their instruction addresses are known. For each speculated branch, there is a tag that identifies all instructions on that speculated path and all instructions on this path are tagged with it. If there is another, later speculated branch, another tag is used, but it is also ordered with respect to the previous tag. Using these tags, the processor can determine exactly which instructions to flush when any of the speculated branches turns out to be incorrect. This is determined after the corresponding branch instruction completes execution in the branch execution unit. Branches may complete execution out of order. When the correct address of a msipredicted branch is calculated, it's forwarded to the fetch unit and the branch prediction unit (BPU). The fetch unit uses it to fetch instructions from the correct path and the BPU uses it to update its prediction state.

The processor can choose to retire the mispredicted branch instruction itself and flush all other later instructions. All rename registers are reclaimed and those physical registers that are mapped to architectural registers at the point the branch is retired are retained. At this point, the processor executes instructions to save the current state and then begins fetching instructions of the interrupt handler.

like image 36
Hadi Brais Avatar answered Mar 07 '23 10:03

Hadi Brais