Is performance reduced when executing loops whose uop count is not a multiple of processor width?

Question

I'm wondering how loops of various sizes perform on recent x86 processors, as a function of number of uops.

Here's a quote from Peter Cordes who raised the issue of non-multiple-of-4 counts in another question:

I also found that the uop bandwidth out of the loop buffer isn't a constant 4 per cycle, if the loop isn't a multiple of 4 uops. (i.e. it's abc, abc, ...; not abca, bcab, ...). Agner Fog's microarch doc unfortunately wasn't clear on this limitation of the loop buffer.

The issue is about whether loops need to be a multiple of N uops to execute at maximum uop throughput, where N is the width of the processor. (i.e., 4 for recent Intel processors). There are a lot of complicating factors when talking about "width" and count uops, but I mostly want to ignore those. In particular, assume no micro or macro-fusion.

Peter gives the following example of a loop with 7 uops in its body:

A 7-uop loop will issue groups of 4|3|4|3|... I haven't tested larger loops (that don't fit in the loop buffer) to see if it's possible for the first instruction from the next iteration to issue in the same group as the taken-branch to it, but I assume not.

More generally, the claim is that each iteration of a loop with x uops in its body will take at least ceil(x / 4) iterations, rather than simply x / 4.

Is this true for some or all recent x86-compatible processors?

BeeOnRope · Accepted Answer

This is a follow-on to the original answer, to analyze the behavior for five additional architectures, based on test results provided by Andreas Abel:

Nehalem
Sandy Bridge
Ivy Bridge
Broadwell
Coffee Lake

We take a quick look at the results on these architectures in addition to Skylake and Haswell. It only needs to be a "quick" look since all the architectures except Nehalem follow one of the existing patterns discussed above.

First, the short nop case which exercises the legacy decoder (for loops that don't fit in the LSD) and the LSD. Here is the cycles/iteration for this scenario, for all 7 architectures.

Figure 2.1: All architectures dense nop performance:

All Architectures Dense Nop Performance

This graph is really busy (click for a larger view) and a bit hard to read since the results for many architectures lie on top of each other, but I tried to ensure that a dedicated reader can track the line for any architecture.

First, let's discuss the big outlier: Nehalem. All of the other architectures have a slope that roughly follows the 4 uops/cycle line, but Nehalem is at almost exactly 3 uops per cycle, so quickly falls behind all of the other architectures. Outside of the initial LSD region, the line is also totally smooth, without the "stair step" appearance seen in the other architectures.

This is entirely consistent with Nehalem having a uop retirement limit of 3 uops/cycle. This is the bottleneck for uops outside of the LSD: they all execute at about exactly 3 uops per cycle, bottlenecked on retirement. The front-end isn't the bottleneck, so the exact uop count and decoding arrangement doens't matter and so the stair-step is absent.

Other than Nehalem, the other architectures, except Broadwell split fairly cleanly into groups: Haswell-like or Skylake-like. That is, all of Sandy Bridge, Ivy Bridge and Haswell behave like Haswell, for loops greater than about 15 uops (Haswell behavior is discussed in the other answer). Even though they are different micro-architectures, they behave largely the same since their legacy decoding capabilities are the same. Below about 15 uops we see Haswell as somewhat faster for any uop count not a multiple of 4. Perhaps it gets an additional unrolling in the LSD due to a larger LSD, or there are other "small loop" optimizations. For Sandy Bridge and Ivy Bridge, this means that small loops should definitely target a uop count which is a multiple of 4.

Coffee Lake behaves similarly to Skylake¹. This makes sense, since the micro-architecture is the same. Coffee Lake appears better than Skylake below about 16 uops, but this is just an effect of Coffee Lake's disabled LSD by default. Skylake was tested with an enabled LSD, before Intel disabled it via microcode update due to a security issue. Coffee Lake was released after this issue was known, so had the LSD disabled out-of-the-box. So for this test, Coffee Lake is using either the DSB (for loops below about 18 uops, which can still fit in the DSB) or the legacy decoder (for the remainder of the loops), which leads to better results for small uop count loops where the LSD imposes an overhead (interesting, for larger loops, the LSD and the legacy decoder happen to impose exactly the same overhead, for very different reasons).

Finally, we take a look at 2-byte NOPs, which aren't dense enough to prevent the use of the DSB (so this case is more reflective of typical code).

Figure 2.1: 2-byte nop performance:

2-byte nop performance

Again, the result is along the same lines as the earlier chart. Nehalem is still the outlier bottlenecked at 3 uops per cycle. For the range up to about 60ish uops, all architectures other than Coffee Lake are using the LSD, and we see that Sandy Bridge and Ivy Bridge perform a bit worse here, rounding up to the next cycle and so only achieving the maximum throughput of 4 uops/cycle if the number of uops in the loop is a multiple of 4. Above 32 uops the "unrolling" feature of Haswell and new uarchs dosn't have any effect, so everything is roughly tied.

Sandy Bridge actually has a few uop ranges (e.g., from 36 through 44 uops) where it performs better than the newer architectures. This seems to occur because not all loops are detected by the LSD and in these ranges the loops are served from the DSB instead. Since the DSB is generally faster, so is Sandy Bridge in these cases.

What Intel Says

You can actually find a section specifically dealing with this topic in the Intel Optimization Manual, section 3.4.2.5, as pointed out by Andreas Abel in the comments. There, Intel says:

The LSD holds micro-ops that construct small “infinite” loops. Micro-ops from the LSD are allocated in the out-of-order engine. The loop in the LSD ends with a taken branch to the beginning of the loop. The taken branch at the end of the loop is always the last micro-op allocated in the cycle. The instruction at the beginning of the loop is always allocated at the next cycle. If the code performance is bound by front end bandwidth, unused allocation slots result in a bubble in allocation, and can cause performance degrada- tion. Allocation bandwidth in Intel microarchitecture code name Sandy Bridge is four micro-ops per cycle. Performance is best, when the number of micro-ops in the LSD result in the least number of unused allo- cation slots. You can use loop unrolling to control the number of micro-ops that are in the LSD.

They go on to show an example where unrolling a loop by a factor of two doesn't help performance due to LSD "rounding", but unrolling by three works. The example is a big confusing since it actually mixes two effects since unrolling more also reduces the loop overhead and hence the number of uops per iteration. A more interesting example would have been where unrolling the loop fewer times led to an increase in performance due to LSD rounding effects.

This section seems to accurately describe the behavior in Sandy Bridge and Ivy Bridge. The results above show that both of these architectures do as described, and you lose 1, 2 or 3 uop execution slots for loops with 4N+3, 4N+2, or 4N+1 uops respectively.

It hasn't been updated with the new performance for Haswell and later however. As described in the other answer, performance has improved from the simple model described above and the behavior is more complex.

¹ There is a weird outlier at 16 uops where Coffee Lake performs worse than all the other architectures, even Nehalem (a regression of about 50%), but maybe this measurement noise?

St.Antario · Answer

TL;DR: For tight loops consisting of exactly 7 uops it results in inefficient retirement bandwidth utilization. Consider manual loop unrolling so the loop will consist of 12 uops

I recently faced retirement bandwidth degradation with loops consisting of 7 uops. After doing some research by myself quick googling leads me to this topic. And here are my 2 cents applying to Kaby Lake i7-8550U CPU:

As @BeeOnRope noted, LSD is turned off on chips like KbL i7-8550U.

Consider the following NASM macro

;rdi = 1L << 31
%macro nops 1
    align 32:
    %%loop:
    times %1 nop
    dec rdi
    ja %%loop
%endmacro

Here is how the "average retirement rate" uops_retired.retire_slots/uops_retired.total_cycle looks like:

enter image description here

The thing to notice here is the retirement degradation when the loop consists of 7 uops. It results in 3.5 uops being retired per cycle.

The average idq delivery rate idq.all_dsb_cycles_any_uops / idq.dsb_cycles looks as

enter image description here

For loops of 7 uops it results in 3.5 uops being delivered to the idq per cycle. Judging by only this counter it is impossible to conclude whether uops cache delivers 4|3 or 6|1 groups.

For loops consisting of 6 uops it results in an efficient utilization of uops cache bandwidth - 6 uops/c. When IDQ gets overflowed the uops cache stays idle until it can deliver 6 uops again.

To check how the uops cache stays idle let's compare idq.all_dsb_cycles_any_uops and cycles

enter image description here

The number of cycles uops are delivered to the idq is equal to the number of total cycles for loops of 7 uops. By contrast the counters are noticeably different for the loop of 6 uops.

The key counters to check is idq_uops_not_delivered.*

enter image description here

As can be seen for the loop of 7 uops we have that the Renamer takes 4|3 groups which results in inefficient retirement bandwidth utilization.

Is performance reduced when executing loops whose uop count is not a multiple of processor width?

Tags:

performance

cpu-architecture

x86

assembly

micro-optimization

BeeOnRope

2 Answers

What Intel Says

BeeOnRope

St.Antario

Recent Activity

Donate For Us

Is performance reduced when executing loops whose uop count is not a multiple of processor width?

Tags:

performance

cpu-architecture

x86

assembly

micro-optimization

BeeOnRope

2 Answers

What Intel Says

BeeOnRope

St.Antario

Related questions

Recent Activity

Donate For Us