I was just going over this answer by Peter Cordes and he says,
Partial-flag stalls happen when flags are read, if they happen at all. P4 never has partial-flag stalls, because they never need to be merged. It has false dependencies instead. Several answers / comments mix up the terminology. They describe a false dependency, but then call it a partial-flag stall. It's a slowdown which happens because of writing only some of the flags, but the term "partial-flag stall" is what happens on pre-SnB Intel hardware when partial-flag writes have to be merged. Intel SnB-family CPUs insert an extra uop to merge flags without stalling. Nehalem and earlier stall for ~7 cycles. I'm not sure how big the penalty is on AMD CPUs.
I don't feel like I understand yet what a "partial flag stall" is. How do I know one has occurred? What triggers the event other than sometimes when flags are read? What does it mean to merge flags? In what condition are "some of the flags written" but a partial-flag merge doesn't happen? What do I need to know about flag stalls to understand them?
Generally speaking a partial flag stall occurs when a flag-consuming instruction reads one or more flags that were not written by the most recent flag-setting instruction.
So an instruction like inc
that sets only some flags (it doesn't set CF
) doesn't inherently cause a partial stall, but will cause a stall if a subsequent instruction reads the flag (CF
) that was not set by inc
(without any intervening instruction that sets the CF
flag). This also implies that instructions that write all interesting flags are never involved in partial stalls since when they are the most recent flag setting instruction at the point a flag reading instruction is executed, they must have written the consumed flag.
So, in general, an algorithm for statically determining whether a partial flags stall will occur is to look at each instruction that uses the flags (generally the jcc
family and cmovcc
and a few specialized instructions like adc
) and then walk backwards to find the first instruction that sets any flag and check if it sets all of the flags read by the consuming instruction. If not, a partial flags stall will occur.
Later architectures, starting with Sandy Bridge, don't suffer a partial flags stall per se, but still suffer a penalty in the form of an additional uop added to the front-end by the instruction in some cases. The rules are slightly different and apply to a narrower set of cases compared to the stall discussed above. In particular, the so-calling flag merging uop is added only when a flag consuming instruction reads from multiple flags and those flags were last set by different instructions. This means, for example, that instructions that examine a single flag never cause a merging uop to be emitted.
Starting from Skylake (and probably starting from Broadwell), I find no evidence of any merging uops. Instead, the uop format has been extended to take up to 3 inputs, meaning that the separately renamed carry flag and the renamed-together SPAZO group flags can both be used as inputs to most instructions. Exceptions include instructions like cmovbe
which has two register inputs, and whose condition be
requires the use of both the C flag and one or more of the SPAZO flags. Most conditional moves use only one or the other of C and SPAZO flags, however, and take one uop.
Here are some examples. We discuss both "[partial flag] stalls" and "merge uops", but as above only at most one of the two applies to any given architecture, so something like "The following causes a stall and a merge uop to be emitted" should be read as "The following causes a stall [on those older architectures which have partial flag stalls] or a merge uop [on those newer architectures which use merge uops instead]".
The following example will cause a stall and merging uop to be emitted on Sandy Bridge and Ivy Bridge, but not on Skylake:
add rbx, 5 ; sets CF, ZF, others
inc rax ; sets ZF, but not CF
ja label ; reads CF and ZF
The ja
instruction reads CF
and ZF
which were last set by the add
and inc
instructions, respectively, so a merge uop is inserted to unify the separately set flags for consumption by ja
. On architectures that stall, a stall occurs because ja
reads from CF
which was not set by the most recent flag setting instruction.
add rbx, 5 ; sets CF, ZF, others
inc rax ; sets ZF, but not CF
jc label ; reads CF
This causes a stall because as in the prior example CF
is read which is not set by the last flag setting instruction (here inc
). In this case, the stall could be avoided by simply swapping the order of the inc
and add
since they are independent and then the jc
would read only from the most recent flag setting operation. There is no merge uop needed because the flags read (only CF
) all come from the same add
instruction.
Note: This case is under debate (see the comments) - but I cannot test it because I don't find evidence of any merging ops at all on my Skylake.
add rbx, 5 ; sets CF, ZF, others
inc rax ; sets ZF, but not CF
jnz label ; reads ZF
Here there is no stall or merging uop needed, even though the last instruction (inc
) only sets some flags, because the consuming jnz
only reads (a subset of) flags set by the inc
and no others. So this common looping idiom (usually with dec
instead of inc
) doesn't inherently cause a problem.
Here's another example that doesn't cause any stall or merge uop:
inc rax ; sets ZF, but not CF
add rbx, 5 ; sets CF, ZF, others
ja label ; reads CF and ZF
Here the ja
does read both CF
and ZF
and an inc
is present which doesn't set ZF
(i.e., a partial flag writing instruction), but there is no problem because the add
comes after the inc
and writes all the relevant flags.
The shift instructions sar
,shr
and shl
in both their variable and fixed count forms behave differently (generally worse) than described above and this varies a fair amount across architectures. This is probably due to their weird and inconsistent flag handling1. For example, on many architectures there is something like a partial flags stall when reading any flag after a shift instruction with a count other than 1. Even on the most recent architectures variable shifts have a significant cost of 3 uops due to flag handling (but there is no more "stall").
I'm not going to include all the gory details here, but I'd recommend looking for the word shift in Agner's microarch doc if you want all the details.
Some rotate instructions also have interesting flag related behavior in some cases similar to shifts.
1 For example, setting different subsets of flags depending on whether the shift count is 0, 1 or some other value.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With