Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fully utilizing pipelines on kaby lake

(Followup code-review question here, with more details of the context of this loop.)


Environment:

  • Windows 7 x64
  • VS 2017 community
  • Targeting x64 code on Intel i7700k (kaby lake)

I don't write a lot of assembler code, and when I do, it's either short enough or simple enough that I don't have to worry much about squeezing the maximum amount of perf out of it. My more complex code is usually written in C and I let the compiler's optimizers worry about latency, code alignment, etc.

However in my current project, MSVC's optimizer does a remarkably poor job on the code in my critical path. So...

I haven't yet found a good tool that does either a static or runtime analysis of x64 assembler code with a view to removing stalls, improving latency, etc. All I've got is the VS profiler which tells me (roughly) which instructions are taking the most time. And the clock on the wall that tells me if the latest change has made things better or worse.

As an alternative, I've been slogging my way thru Agner's docs with the hope of squeezing some more perf out of my code. The problem is that it's hard to understand any of his work until you understand all of it. But pieces of it make sense, and I'm trying to apply what I have learned.

What that in mind, here's the core of my innermost loop which (not surprisingly) is where the VS profiler says my time is being spent:

nottop:

vpminub ymm2, ymm2, ymm3 ; reset out of range values
vpsubb  ymm2, ymm2, ymm0 ; take a step

top:
vptest  ymm2, ymm1       ; check for out of range values
jnz nottop

; Outer loop that does some math, does a "vpsubb ymm2, ymm2, ymm0",
; and eventually jumps back to top

Yup, this is pretty much a textbook example of a dependency chain: Each instruction in this tight little loop depends upon the results of the previous operation. This means there can be no parallelism, which means I'm not taking full advantage of the processor.

Inspired by Agner's "optimizing assembler" doc, I've come up with an approach that (hopefully) allows me to do 2 operations at a time, so I could have one pipeline updating ymm2 and another updating (say) ymm8.

It's a non-trivial change though, so before I start ripping everything apart, I wonder if it's likely to help. Looking at Agner's "Instruction tables" for kaby lake (my target), I see that:

        uops
        each
        port    Latency
pminub  p01     1
psubb   p015    1
ptest   p0 p5   3

Given this, it looks like while one pipeline is using p0+p5 to do the vptest against ymm2, another can be utilizing p1 to do both vpminub and vpsubb on ymm8. Yeah, things are still going to get stacked up behind vptest, but it should help.

Or would it?

I'm currently running this code from 8 threads (Yes, 8 threads really does give me better total throughput than 4,5,6 or 7). Given that my i7700k has 4 hyperthreaded cores, wouldn't the fact that there are 2 threads running on each core mean that I'm already maxing out the ports? Ports are "per core," not "per logical cpu," right?

So.

Based on my current understanding of Agner's work, it appears that there is no way to further optimize this code in its current form. If I want better perf, I'm going to need to come up with a different approach.

And yes, I'm sure if I posted my whole asm routine here, someone could suggest an alternate approach. But the purpose of this Question isn't to have someone write my code for me. I'm trying to see if I'm starting to understand how to think about optimizing asm code.

Is this (roughly) the right way to look at things? Am I missing a few pieces? Or is this flat-out wrong?

like image 995
David Wohlferd Avatar asked Jul 16 '17 05:07

David Wohlferd


People also ask

What is Kaby Lake architecture?

Kaby Lake features a new graphics architecture to improve performance in 3D graphics and 4K video playback. It adds native HDCP 2.2 support, along with fixed function decode of H. 264 (AVC), HEVC Main and Main10/10-bit, and VP9 10-bit and 8-bit video.

How old is Kaby Lake?

The 7th-generation microarchitecture in Intel's Core family of CPU chips. Introduced in 2016, Kaby Lake superseded Skylake using the same 14 nm process technology. Kaby Lake offered faster clock speeds with a maximum Turbo rate of 3.6 GHz. Graphics and 4K video performance are also improved.

What is Kaby Lake refresh?

Kaby Lake R (KBL-R) is the name of the core for Intel's line of low-power mobile processors based on the Kaby Lake microarchitecture serving as a refresh to Kaby Lake U.

Are Kaby Lake and Skylake compatible?

NOTE: Kaby Lake motherboards also support Skylake CPUs.


2 Answers

TL:DR: I think Hyperthreading should keep all your vector ALU ports busy with 2 threads per core.


vptest doesn't write either vector register, only flags. The next iteration doesn't have to wait for it, so its latency is mostly irrelevant.

Only jnz is dependent on vptest, and speculative execution + branch prediction hides the latency of control dependencies. vptest latency is relevant for how quickly a branch mispredict can be detected, but not for throughput in the correctly-predicted case.


Good point about hyperthreading. Interleaving two independent dep chains within a single thread can be helpful, but it's a lot harder to do correctly and efficiently.

Let's look at the instructions in your loop. predicted-taken jnz will always run on p6, so we can discount it. (Unrolling could actually hurt: predicted-not-taken jnz can also run on p0 or p6)

On a core by itself, your loop should run at 2 cycles per iteration, bottlenecked on latency. It's 5 fused-domain uops, so it takes 1.25 cycles to issue. (Unlike test, jnz can't macro-fuse with vptest). With hyperthreading, the front-end is already a worse bottleneck than latency. Each thread can issue 4 uops every other cycle, which is less than the 5 uops every other cycle of the dependency-chain bottleneck.

(This is common for recent Intel, especially SKL/KBL: many uops have enough ports to choose from that sustaining 4 uops per clock throughput is realistic, especially with SKL's improved throughput of the uop-cache and decoders to avoid issue bubbles due to front-end limitations rather than the back-end filling up.)

Every time one thread stalls (e.g. for a branch mispredict), the front-end can catch up on the other thread and get lots of future iterations into the out-of-order core for it to chew through at one iter per 2 cycles. (Or less because of execution-port throughput limits, see below).


Execution-port throughput (unfused-domain):

Only 1 of every 5 uops runs on p6 (the jnz). It can't be a bottleneck because the front-end issue rate limits us to less than one branch issuing per clock while running this loop.

The other 4 vector ALU uops per iteration have to run on the 3 ports with vector execution units. The p01 and p015 uops have enough scheduling flexibility that no single port will be a bottleneck, so we can just look at total ALU throughput. That's 4 uops / iter for 3 ports, for a max average throughput for a physical core of one iter per 1.333 cycles.

For a single thread (no HT), this is not the most serious bottleneck. But with two hyperthreads, that's one iter per 2.6666 cycles.

Hyperthreading should saturate your execution units, with some front-end throughput to spare. Each thread should average one per 2.666c, with the front-end able to issue at one per 2.5c. Since latency only limits you to one per 2c, it can catch up after any delays on the critical-path due to resource conflicts. (a vptest uop stealing a cycle from one of the other two uops).

If you can change the loop to check any less often, or with fewer vector uops, that could be a win. But everything I'm thinking of is more vector uops (e.g. vpand instead of vptest and then vpor a couple of those results together before checking... Or vpxor to produce an all-zero vector when vptest would). Maybe if there was a vector XNOR or something, but there isn't.


To check what's actually happening, you could use perf counters to profile your current code, and see what uop throughput you're getting for a whole core (not just each logical thread separately). Or profile one logical thread and see if it's saturating about half of p015.

like image 128
Peter Cordes Avatar answered Nov 14 '22 10:11

Peter Cordes


A partial answer:

Intel provides a tool named Intel Architecture Code Analyzer (described here) that does static analysis of code, showing (kind of) what ports are in use in a section of asm code.

Unfortunately:

  • v2.3 doesn't include the essential (and only) header file. You can find this file in v2.2.
  • v2.2 includes the header, but omits a python script (pt.py) for analyzing the output. This file is also not included in v2.3 (haven't found it yet).
  • One of the output formats from iaca is a .dot file, that is read by graphviz, but the Intel docs fail to describe which of the 38 executables in graphviz is used to display output.

But perhaps most importantly (for my needs):

  • v2.3 (currently the most recent release) supports Skylake, but not Kaby Lake.

Given how the implementation details change between processors, that makes all the output suspect. Dates in the pdf file suggest that v2.3 was released July 2017, meaning that I'm probably going to have to wait a while for the next release.

like image 34
David Wohlferd Avatar answered Nov 14 '22 12:11

David Wohlferd