Consider this simple C++ function to calculate the prefix sum of an array: <pre class="prettyprint lang-cpp prettyprint-override"><code>void prefix_sum(const uint32_t* input, uint32_t* output, size_t size) { uint32_t total = 0; for (size_t i = 0; i < size; i++) { total += input[i]; output[i] = total; } } </code></pre> The loop compiles to the following assembly on gcc 5.5: <pre class="prettyprint"><code>.L5: add ecx, DWORD PTR [rdi+rax*4] mov DWORD PTR [rsi+rax*4], ecx add rax, 1 cmp rdx, rax jne .L5 </code></pre> I don't see anything that would prevent this from running at 1 cycle per iteration, yet I consistently measure it at 1.32 (+/- 0.01) cycles/iteration on my Skylake i7-6700HQ, when running it against 8 KiB input/output arrays. The loop is served out of the uop cache and doesn't cross any uop cache boundary and performance counters don't indicate any front-end bottleneck. It's 4 fused uops1, and this CPU can sustain 4 fused ops/cycle. There are carried dependency chains through <code>ecx</code> and <code>rax</code>, each of 1 cycle, but these <code>add</code> uops can go to any of the 4 ALU ports, so seem unlikely to conflict. The fused <code>cmp</code> needs to go to p6 which is more of a concern, but I measure only 1.1 uops/iteration to p6. That would explain 1.1 cycles per iteration, but not 1.4. If I unroll the loop by 2x port pressure is much lower: less than 0.7 uops to all of p0156, yet performance is still unexpectedly slow at 1.3 cycles per iteration. There is one store per iteration, but we can do one store per cycle. There is one load per iteration, but we can do two of those per cycle. There are two complex AGUs per cycle, but we can do two of those per cycle. What's the bottleneck here? Interestingly I tried the Ithermal performance predictor and it gets it almost exactly right: estimating 1.314 cycles versus my measurement of 1.32. <hr> 1 I confirmed macro and micro-fusion fusion via the <code>uops_issued.any</code> counter which counts in the fused domain and reads 4.0 fused uops per iteration for this loop.

I just played around with the instructions on Ithermal Performance predictor and i might have found the issue. Trying out <pre class="prettyprint"><code>add ecx, DWORD PTR [rdi] mov DWORD PTR [rsi], ecx add rax, 1 cmp rdx, rax </code></pre> gives stunning 1.131 cycles per iteration. Cross checking with adding 0 in each iteration (which gives again 1.3 cycles) eliminates the possibility of a store/load bottleneck. Which finally suggests an issue with the addressing modes. (editor's note: this is interesting experimental data, matching what I posted in the thread on Agner Fog's blog which the guess below misinterprets. Simpler addressing modes speed it up even though there's no unlamination.) <hr> (editor's note: this part is wrong: we know from the question there's no un-lamination because <code>uops_issued.any</code> = 4 per iteration.) <s>I think your CPU un-laminates your add/mov in case of indexed addressing. This behavior is well documented for several Architectures (SnB, SKL, HWL) and someone did a great job on stackoverflow describing the whole thing: https://stackoverflow.com/a/31027695/1925289 In short: if too many registers & flags are involved, the fused op (DSB) gets un-laminated (IDQ) thus effectively un-fused again.</s> Other resources: <ul> <li>Ad fusing limits: https://www.agner.org/optimize/blog/read.php?i=415#852</li> <li>Unlamination: https://easyperf.net/blog/2018/02/15/MicroFusion-in-Intel-CPUs#unlamination-example-1 </li> </ul>

Why does this loop take 1.32 cycles per iteration

Tags:

Consider this simple C++ function to calculate the prefix sum of an array:

void prefix_sum(const uint32_t* input, uint32_t* output, size_t size) {
    uint32_t total = 0;
    for (size_t i = 0; i < size; i++) {
        total += input[i];
        output[i] = total;
    }
}

The loop compiles to the following assembly on gcc 5.5:

.L5:
        add     ecx, DWORD PTR [rdi+rax*4]
        mov     DWORD PTR [rsi+rax*4], ecx
        add     rax, 1
        cmp     rdx, rax
        jne     .L5

I don't see anything that would prevent this from running at 1 cycle per iteration, yet I consistently measure it at 1.32 (+/- 0.01) cycles/iteration on my Skylake i7-6700HQ, when running it against 8 KiB input/output arrays.

The loop is served out of the uop cache and doesn't cross any uop cache boundary and performance counters don't indicate any front-end bottleneck.

It's 4 fused uops¹, and this CPU can sustain 4 fused ops/cycle.

There are carried dependency chains through ecx and rax, each of 1 cycle, but these add uops can go to any of the 4 ALU ports, so seem unlikely to conflict. The fused cmp needs to go to p6 which is more of a concern, but I measure only 1.1 uops/iteration to p6. That would explain 1.1 cycles per iteration, but not 1.4. If I unroll the loop by 2x port pressure is much lower: less than 0.7 uops to all of p0156, yet performance is still unexpectedly slow at 1.3 cycles per iteration.

There is one store per iteration, but we can do one store per cycle.

There is one load per iteration, but we can do two of those per cycle.

There are two complex AGUs per cycle, but we can do two of those per cycle.

What's the bottleneck here?

Interestingly I tried the Ithermal performance predictor and it gets it almost exactly right: estimating 1.314 cycles versus my measurement of 1.32.

¹ I confirmed macro and micro-fusion fusion via the uops_issued.any counter which counts in the fused domain and reads 4.0 fused uops per iteration for this loop.

996

asked Aug 12 '19 20:08

BeeOnRope

1 Answers

I just played around with the instructions on Ithermal Performance predictor and i might have found the issue. Trying out

add     ecx, DWORD PTR [rdi]
mov     DWORD PTR [rsi], ecx
add     rax, 1
cmp     rdx, rax

gives stunning 1.131 cycles per iteration. Cross checking with adding 0 in each iteration (which gives again 1.3 cycles) eliminates the possibility of a store/load bottleneck. Which finally suggests an issue with the addressing modes.

(editor's note: this is interesting experimental data, matching what I posted in the thread on Agner Fog's blog which the guess below misinterprets. Simpler addressing modes speed it up even though there's no unlamination.)

(editor's note: this part is wrong: we know from the question there's no un-lamination because uops_issued.any = 4 per iteration.)

I think your CPU un-laminates your add/mov in case of indexed addressing. This behavior is well documented for several Architectures (SnB, SKL, HWL) and someone did a great job on stackoverflow describing the whole thing: https://stackoverflow.com/a/31027695/1925289 In short: if too many registers & flags are involved, the fused op (DSB) gets un-laminated (IDQ) thus effectively un-fused again.

Other resources:

Ad fusing limits: https://www.agner.org/optimize/blog/read.php?i=415#852
Unlamination: https://easyperf.net/blog/2018/02/15/MicroFusion-in-Intel-CPUs#unlamination-example-1

answered Oct 21 '22 11:10

Zacharias

Related questions
                            
                                Simple way to run SwiftLint for our Swift package
                            
                                Calling a method with Provider.of(context) inside of dispose() method cause "Looking up a deactivated widget's ancestor is unsafe."
                            
                                Game development with Unity without Unity editor. Is it possible?
                            
                                HOWTO: Write Python API wrapper? [closed]
                            
                                Is there a way to declare an annotation attribute for *any* enum?
                            
                                iPhone: HTTPS client cert authentication
                            
                                Google Maps API key not working
                            
                                How to use the jQuery Validation plugin with metadata, jQuery Forms and xVal together?
                            
                                Warning for iOS/iPhone users about duplicate NSNotification observations
                            
                                How to make Android phone as a bluetooth headset?
                            
                                syntax aware diff tools? [closed]
                            
                                What were the (then) unpublished optimizations that Steve Yegge referred to in "Dynamic Languages Strike Back"?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With