What is a Partial Flag Stall?

Question

I was just going over this answer by Peter Cordes and he says,

Partial-flag stalls happen when flags are read, if they happen at all. P4 never has partial-flag stalls, because they never need to be merged. It has false dependencies instead. Several answers / comments mix up the terminology. They describe a false dependency, but then call it a partial-flag stall. It's a slowdown which happens because of writing only some of the flags, but the term "partial-flag stall" is what happens on pre-SnB Intel hardware when partial-flag writes have to be merged. Intel SnB-family CPUs insert an extra uop to merge flags without stalling. Nehalem and earlier stall for ~7 cycles. I'm not sure how big the penalty is on AMD CPUs.

I don't feel like I understand yet what a "partial flag stall" is. How do I know one has occurred? What triggers the event other than sometimes when flags are read? What does it mean to merge flags? In what condition are "some of the flags written" but a partial-flag merge doesn't happen? What do I need to know about flag stalls to understand them?

BeeOnRope · Accepted Answer

Generally speaking a partial flag stall occurs when a flag-consuming instruction reads one or more flags that were not written by the most recent flag-setting instruction.

So an instruction like inc that sets only some flags (it doesn't set CF) doesn't inherently cause a partial stall, but will cause a stall if a subsequent instruction reads the flag (CF) that was not set by inc (without any intervening instruction that sets the CF flag). This also implies that instructions that write all interesting flags are never involved in partial stalls since when they are the most recent flag setting instruction at the point a flag reading instruction is executed, they must have written the consumed flag.

So, in general, an algorithm for statically determining whether a partial flags stall will occur is to look at each instruction that uses the flags (generally the jcc family and cmovcc and a few specialized instructions like adc) and then walk backwards to find the first instruction that sets any flag and check if it sets all of the flags read by the consuming instruction. If not, a partial flags stall will occur.

Later architectures, starting with Sandy Bridge, don't suffer a partial flags stall per se, but still suffer a penalty in the form of an additional uop added to the front-end by the instruction in some cases. The rules are slightly different and apply to a narrower set of cases compared to the stall discussed above. In particular, the so-calling flag merging uop is added only when a flag consuming instruction reads from multiple flags and those flags were last set by different instructions. This means, for example, that instructions that examine a single flag never cause a merging uop to be emitted.

Starting from Skylake (and probably starting from Broadwell), I find no evidence of any merging uops. Instead, the uop format has been extended to take up to 3 inputs, meaning that the separately renamed carry flag and the renamed-together SPAZO group flags can both be used as inputs to most instructions. Exceptions include instructions like cmovbe which has two register inputs, and whose condition be requires the use of both the C flag and one or more of the SPAZO flags. Most conditional moves use only one or the other of C and SPAZO flags, however, and take one uop.

Examples

Here are some examples. We discuss both "[partial flag] stalls" and "merge uops", but as above only at most one of the two applies to any given architecture, so something like "The following causes a stall and a merge uop to be emitted" should be read as "The following causes a stall [on those older architectures which have partial flag stalls] or a merge uop [on those newer architectures which use merge uops instead]".

Stall and merging uop

The following example will cause a stall and merging uop to be emitted on Sandy Bridge and Ivy Bridge, but not on Skylake:

add rbx, 5   ; sets CF, ZF, others
inc rax      ; sets ZF, but not CF
ja  label    ; reads CF and ZF

The ja instruction reads CF and ZF which were last set by the add and inc instructions, respectively, so a merge uop is inserted to unify the separately set flags for consumption by ja. On architectures that stall, a stall occurs because ja reads from CF which was not set by the most recent flag setting instruction.

Stall only

add rbx, 5   ; sets CF, ZF, others
inc rax      ; sets ZF, but not CF
jc  label    ; reads CF

This causes a stall because as in the prior example CF is read which is not set by the last flag setting instruction (here inc). In this case, the stall could be avoided by simply swapping the order of the inc and add since they are independent and then the jc would read only from the most recent flag setting operation. There is no merge uop needed because the flags read (only CF) all come from the same add instruction.

Note: This case is under debate (see the comments) - but I cannot test it because I don't find evidence of any merging ops at all on my Skylake.

No stall or merging uop

add rbx, 5   ; sets CF, ZF, others
inc rax      ; sets ZF, but not CF
jnz  label   ; reads ZF

Here there is no stall or merging uop needed, even though the last instruction (inc) only sets some flags, because the consuming jnz only reads (a subset of) flags set by the inc and no others. So this common looping idiom (usually with dec instead of inc) doesn't inherently cause a problem.

Here's another example that doesn't cause any stall or merge uop:

inc rax      ; sets ZF, but not CF
add rbx, 5   ; sets CF, ZF, others
ja  label    ; reads CF and ZF

Here the ja does read both CF and ZF and an inc is present which doesn't set ZF (i.e., a partial flag writing instruction), but there is no problem because the add comes after the inc and writes all the relevant flags.

Shifts

The shift instructions sar,shr and shl in both their variable and fixed count forms behave differently (generally worse) than described above and this varies a fair amount across architectures. This is probably due to their weird and inconsistent flag handling¹. For example, on many architectures there is something like a partial flags stall when reading any flag after a shift instruction with a count other than 1. Even on the most recent architectures variable shifts have a significant cost of 3 uops due to flag handling (but there is no more "stall").

I'm not going to include all the gory details here, but I'd recommend looking for the word shift in Agner's microarch doc if you want all the details.

Some rotate instructions also have interesting flag related behavior in some cases similar to shifts.

¹ For example, setting different subsets of flags depending on whether the shift count is 0, 1 or some other value.

What is a Partial Flag Stall?

Tags:

cpu-architecture

x86

assembly

intel

NO WAR WITH RUSSIA

1 Answers

Examples

Stall and merging uop

Stall only

No stall or merging uop

Shifts

BeeOnRope

Recent Activity

Donate For Us

What is a Partial Flag Stall?

Tags:

cpu-architecture

x86

assembly

intel

NO WAR WITH RUSSIA

1 Answers

Examples

Stall and merging uop

Stall only

No stall or merging uop

Shifts

BeeOnRope

Related questions

Recent Activity

Donate For Us