Fully utilizing pipelines on kaby lake

Tags:

(Followup code-review question here, with more details of the context of this loop.)

Environment:

Windows 7 x64
VS 2017 community
Targeting x64 code on Intel i7700k (kaby lake)

I don't write a lot of assembler code, and when I do, it's either short enough or simple enough that I don't have to worry much about squeezing the maximum amount of perf out of it. My more complex code is usually written in C and I let the compiler's optimizers worry about latency, code alignment, etc.

However in my current project, MSVC's optimizer does a remarkably poor job on the code in my critical path. So...

I haven't yet found a good tool that does either a static or runtime analysis of x64 assembler code with a view to removing stalls, improving latency, etc. All I've got is the VS profiler which tells me (roughly) which instructions are taking the most time. And the clock on the wall that tells me if the latest change has made things better or worse.

As an alternative, I've been slogging my way thru Agner's docs with the hope of squeezing some more perf out of my code. The problem is that it's hard to understand any of his work until you understand all of it. But pieces of it make sense, and I'm trying to apply what I have learned.

What that in mind, here's the core of my innermost loop which (not surprisingly) is where the VS profiler says my time is being spent:

Click to copy

nottop:

vpminub ymm2, ymm2, ymm3 ; reset out of range values
vpsubb  ymm2, ymm2, ymm0 ; take a step

top:
vptest  ymm2, ymm1       ; check for out of range values
jnz nottop

; Outer loop that does some math, does a "vpsubb ymm2, ymm2, ymm0",
; and eventually jumps back to top

Yup, this is pretty much a textbook example of a dependency chain: Each instruction in this tight little loop depends upon the results of the previous operation. This means there can be no parallelism, which means I'm not taking full advantage of the processor.

Inspired by Agner's "optimizing assembler" doc, I've come up with an approach that (hopefully) allows me to do 2 operations at a time, so I could have one pipeline updating ymm2 and another updating (say) ymm8.

It's a non-trivial change though, so before I start ripping everything apart, I wonder if it's likely to help. Looking at Agner's "Instruction tables" for kaby lake (my target), I see that:

Click to copy

        uops
        each
        port    Latency
pminub  p01     1
psubb   p015    1
ptest   p0 p5   3

Given this, it looks like while one pipeline is using p0+p5 to do the vptest against ymm2, another can be utilizing p1 to do both vpminub and vpsubb on ymm8. Yeah, things are still going to get stacked up behind vptest, but it should help.

Or would it?

I'm currently running this code from 8 threads (Yes, 8 threads really does give me better total throughput than 4,5,6 or 7). Given that my i7700k has 4 hyperthreaded cores, wouldn't the fact that there are 2 threads running on each core mean that I'm already maxing out the ports? Ports are "per core," not "per logical cpu," right?

So.

Based on my current understanding of Agner's work, it appears that there is no way to further optimize this code in its current form. If I want better perf, I'm going to need to come up with a different approach.

And yes, I'm sure if I posted my whole asm routine here, someone could suggest an alternate approach. But the purpose of this Question isn't to have someone write my code for me. I'm trying to see if I'm starting to understand how to think about optimizing asm code.

Is this (roughly) the right way to look at things? Am I missing a few pieces? Or is this flat-out wrong?

995

asked Jul 16 '17 05:07

David Wohlferd

2 Answers

TL:DR: I think Hyperthreading should keep all your vector ALU ports busy with 2 threads per core.

vptest doesn't write either vector register, only flags. The next iteration doesn't have to wait for it, so its latency is mostly irrelevant.

Only jnz is dependent on vptest, and speculative execution + branch prediction hides the latency of control dependencies. vptest latency is relevant for how quickly a branch mispredict can be detected, but not for throughput in the correctly-predicted case.

Good point about hyperthreading. Interleaving two independent dep chains within a single thread can be helpful, but it's a lot harder to do correctly and efficiently.

Let's look at the instructions in your loop. predicted-taken jnz will always run on p6, so we can discount it. (Unrolling could actually hurt: predicted-not-taken jnz can also run on p0 or p6)

On a core by itself, your loop should run at 2 cycles per iteration, bottlenecked on latency. It's 5 fused-domain uops, so it takes 1.25 cycles to issue. (Unlike test, jnz can't macro-fuse with vptest). With hyperthreading, the front-end is already a worse bottleneck than latency. Each thread can issue 4 uops every other cycle, which is less than the 5 uops every other cycle of the dependency-chain bottleneck.

(This is common for recent Intel, especially SKL/KBL: many uops have enough ports to choose from that sustaining 4 uops per clock throughput is realistic, especially with SKL's improved throughput of the uop-cache and decoders to avoid issue bubbles due to front-end limitations rather than the back-end filling up.)

Every time one thread stalls (e.g. for a branch mispredict), the front-end can catch up on the other thread and get lots of future iterations into the out-of-order core for it to chew through at one iter per 2 cycles. (Or less because of execution-port throughput limits, see below).

Execution-port throughput (unfused-domain):

Only 1 of every 5 uops runs on p6 (the jnz). It can't be a bottleneck because the front-end issue rate limits us to less than one branch issuing per clock while running this loop.

The other 4 vector ALU uops per iteration have to run on the 3 ports with vector execution units. The p01 and p015 uops have enough scheduling flexibility that no single port will be a bottleneck, so we can just look at total ALU throughput. That's 4 uops / iter for 3 ports, for a max average throughput for a physical core of one iter per 1.333 cycles.

For a single thread (no HT), this is not the most serious bottleneck. But with two hyperthreads, that's one iter per 2.6666 cycles.

Hyperthreading should saturate your execution units, with some front-end throughput to spare. Each thread should average one per 2.666c, with the front-end able to issue at one per 2.5c. Since latency only limits you to one per 2c, it can catch up after any delays on the critical-path due to resource conflicts. (a vptest uop stealing a cycle from one of the other two uops).

If you can change the loop to check any less often, or with fewer vector uops, that could be a win. But everything I'm thinking of is more vector uops (e.g. vpand instead of vptest and then vpor a couple of those results together before checking... Or vpxor to produce an all-zero vector when vptest would). Maybe if there was a vector XNOR or something, but there isn't.

To check what's actually happening, you could use perf counters to profile your current code, and see what uop throughput you're getting for a whole core (not just each logical thread separately). Or profile one logical thread and see if it's saturating about half of p015.

128

answered Nov 14 '22 10:11

Peter Cordes

A partial answer:

Intel provides a tool named Intel Architecture Code Analyzer (described here) that does static analysis of code, showing (kind of) what ports are in use in a section of asm code.

Unfortunately:

v2.3 doesn't include the essential (and only) header file. You can find this file in v2.2.
v2.2 includes the header, but omits a python script (pt.py) for analyzing the output. This file is also not included in v2.3 (haven't found it yet).
One of the output formats from iaca is a .dot file, that is read by graphviz, but the Intel docs fail to describe which of the 38 executables in graphviz is used to display output.

But perhaps most importantly (for my needs):

v2.3 (currently the most recent release) supports Skylake, but not Kaby Lake.

Given how the implementation details change between processors, that makes all the output suspect. Dates in the pdf file suggest that v2.3 was released July 2017, meaning that I'm probably going to have to wait a while for the next release.

answered Nov 14 '22 12:11

David Wohlferd

Related questions
                            
                                Insertion in the middle of ArrayList vs LinkedList [duplicate]
                            
                                ASP.NET Web API performance issue
                            
                                MySQL queries are very slow even using INNER JOIN instead of IN
                            
                                Check the performance of Chrome Extension
                            
                                Equivalient method overload why necessary?
                            
                                what is ScriptResource.axd in the page load in web app?
                            
                                Speed differences between proc, Proc.new, lambda, and stabby lambda
                            
                                Why is checking for an attribute using dot notation before removing faster than removing the attribute outright?
                            
                                unique rows in dplyr : row_number() from tbl_dt inconsistent to tbl_df
                            
                                Using python efficiently to calculate hamming distances [closed]
                            
                                Should I Provide Data In Routing Logic or Meteor Template Helper?
                            
                                Boost pool allocator slower than new
                            
                                numpy performance differences between Linux and Windows
                            
                                Android WebSocket Services making multiple connections
                            
                                Optimizing jQuery selector / addBack() when dealing with a large collection
                            
                                in_array() performance optimization
                            
                                Summing data from array based on other array in Numpy
                            
                                Why does inlining Math.max give over 200x slower code?
                            
                                How to speed up heavy conditional formatting rules
                            
                                Which type of cache is suitable for using in Umbraco project and how can I implement an intelligent cache?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fully utilizing pipelines on kaby lake

Tags:

performance

assembly

x86-64

micro-optimization

avx2

David Wohlferd

People also ask

2 Answers

Peter Cordes

David Wohlferd

Recent Activity

Donate For Us