Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AVX-512 and Branching

I'm confused as to what masking can do in theory in relation to branches. Let's say I have a Skylake-SP (ha, I wish..), and we're ignoring compiler capabilities, just what's possible in theory:

If a branch conditional is dependant on a static flag, and all branches set an array to a computational result, assuming the compiler does not optimize this to two separate loops anyways, can it vectorize?

do i = 1, nx
  if (my_flag .eq. 0) then
    a(i) = b(i) ** 2
  else
    a(i) = b(i) ** 3
  end if
end do

If only as subset of the branches are setting the value in question, can it vectorize?

do i = 1, nx
  if (my_flag .eq. 0) then
    a(i) = b(i) ** 2
  end if
end do

If a branch conditional is in itself dependent on vector data, can it vectorize?

do i = 1, nx
  if (c(i) > 0) then
    a(i) = b(i) ** 2
  else
    a(i) = b(i) ** 3
  end if
end do
like image 545
Michel Müller Avatar asked Nov 25 '17 01:11

Michel Müller


1 Answers

Yes, an efficient asm implementation is possible with any of SSE2 / SSE4.1 (for blendps) / AVX / AVX-512, for all of your loops, and compilers do auto-vectorize in practice, but gcc7.2 / clang5.0 / ICC18 all have missed optimizations.

According to static analysis for Skylake-AVX512 (see below), an efficient unrolled implementation of your final loop can run at one 64 byte vector of results per 1.25 clock cycles (plus loop overhead depending on how much you unroll). In practice, 1.33 or 1.5 clock cycles per vector is probably achievable, if your data is hot in L1D cache. Otherwise you easily bottleneck on L2 bandwidth, because you load 2x 64B per store vector 64B store.

For a C version of your loop, gcc, clang, and ICC all auto-vectorize more or less like I did by hand: See source + asm on the Godbolt compiler explorer.

I had to use -ffast-math with gcc for it to auto-vectorize. IDK why it doesn't realize it can safely auto-vectorize without breaking strict FP rules.

Clang seems to be evaluating tmp*tmp and tmp*tmp*tmp separately, and blending those two results instead of conditionally doing the 2nd multiply.

gcc does both multiplies and uses a separate movaps to merge the other way because it doesn't figure out how to invert the condition.

ICC uses KNOTW to invert the condition but then does the 2nd multiply with merge-masking exactly like I do.

Changing the code to do the extra multiply (**3 instead of **2) in the if branch instead of the else branch made all 3 compilers generate better code without each of their missed-optimizations from branching the other way. (There are still missed optimizations for gcc, but ICC and clang are looking solid, both essentially doing the same thing my hand-written code does.)

ICC chooses to only auto-vectorize this with 256b vectors. Maybe it does that by default to avoid lowering the max turbo clock speed? Maybe there's an option to use full-width vectors? gcc 8.0 snapshot also does that, but gcc7.2 uses ZMM vectors.


AVX-512 mask registers and merge-masking makes it even more efficient, but doing both ways and then blending has been a thing with SIMD (or even non-SIMD branchless code) for a long time. e.g. to conditionally add based on a vector compare result, use that vector compare result as an AND mask to leave some elements untouched, and make other elements zero.

0 is the additive identity: x + 0 = x. So x + (y&mask) is a no-op if the mask is all-zero, or it's x+y if the mask is all-one. See How to use if condition in intrinsics. (Fun trick: use a packed-compare result as an integer -1 or 0, so you can count matches but subtracting the compare-mask).

It's less simple for multiply because 1 is the multiplicative identity, but you can solve that by blending.

assuming the compiler does not optimize this to two separate loops anyways, can it vectorize?

In that first case, you should be unhappy with your compiler if it doesn't hoist the condition out of the loop and make two loops. Especially in the 2nd case, where it only needs one loop, because if the condition is false the array isn't modified.


Let's just talk about the 3rd case, because it's only one where the compiler shouldn't just hoist the condition. (And if your compiler is feeling dumb, it can use this version with a loop-invariant mask of all-zero or all-one for the other versions).

if (c(i) > 0)

So we need to load a vector of elements from c and compare against zero. AVX512 can do this for a vector of 16 single-precision float with one instruction with a mask register destination and a memory source operand.

; with zmm0 = 0.0 in all elements, from vxorps xmm0,xmm0,xmm0 outside the loop.
vcmpps    k1, zmm0, [rdx],  _CMP_NLT_UQ     ; !(0 < c(i))

I know (from writing the next part already) that I'm going to want k1 to be true for elements where the c(i) > 0 condition is false. Only the 2nd vector operand can be memory instead of a register, so I had to reverse it and use not-less-than instead of not-greater-than. (And I can't just use >= instead of <, because that would put the unordered case (one or both NaN) in the wrong category. FP compares have 4 possible results: above/below/equal/unordered, so you have to pick a predicate that does what you want (i.e. what the source says, if you're a compiler) for all 4 cases. If you compile with -ffast-math, the compiler is allowed to ignore the possibility of NaN.

If you need to chain two conditions together, AVX512 compare-into-mask instructions can mask the operation of writing into the mask, with zero-masking or merge-masking.

vcmpltps    k1,        zmm1, zmm2       ; k1 = zmm1<zmm2
vcmpltps    k2{k1}{z}, zmm3, zmm4       ; k2 = (zmm3<zmm4) & (zmm1<zmm2)

k2 is 0 everywhere that that zmm3k1 was zero, because we used k1 as a zero-mask.


  if (c(i) > 0) then
    a(i) = b(i) ** 2
  else
    a(i) = b(i) ** 3
  end if

The common subexpression here is b(i) * b(i). We can get b(i)**3 from that by multiplying by b(i) one extra time.

vmovups    zmm1, [rsi]       ; load a vector from b(i)
vmulps     zmm2, zmm1, zmm1  ; zmm2 = zmm1*zmm1 = b(i)**2

AVX-512 can merge based on a mask as part of (almost) any other instruction.

vmulps     zmm2{k1}, zmm2, zmm1  ; zmm2 *= zmm1   for elements where k1 is true

vmovups    [rdi], zmm2           ; store all 16 elements into a(i)

BTW, AVX512 has merge-masking for stores. Previous SIMD instruction sets would load from [rdi], blend, then store back into [rdi]. This means you can implement your 2nd loop (sometimes leave a(i) unmodified) with a per-element condition more efficiently than with AVX1/ AVX2.


Putting this all together: (NASM syntax)

 ; x86-64 System V calling convention
 ; args: rdi = a() output array.
 ;       rsi = b() input array
 ;       rdx = c() array to be tested for positive numbers
 ;       rcx = count (in elements)
 ; preferably all 64-byte aligned, but will work slowly if some aren't
 ; rcx must be >= 16, and a multiple of 16, because I didn't write any cleanup code

global square_or_cube
square_or_cube: 

    vxorps     xmm0,  xmm0,xmm0

 .loop:                          ; do {
    vcmpps     k1, zmm0, [rdx], 21    ; _CMP_NLT_UQ  ; !(0 < c(i))

    vmovups    zmm1, [rsi]            ; load a vector from b(i)
    vmulps     zmm2,     zmm1, zmm1   ; zmm2 = zmm1*zmm1 = b(i)**2

    vmulps     zmm2{k1}, zmm2, zmm1   ; zmm2 *= zmm1   for elements where k1 is true, otherwise unmodified.
    vmovups    [rdi], zmm2            ; store all 16 elements into a(i)

    ; TODO: unroll some and/or use indexed addressing mode tricks to save instructions
    add         rdi, 64      ; pointer increments
    add         rsi, 64
    add         rdx, 64

    sub         rcx, 16         ;  count -= 16 
    ja        .loop             ; } while(count>0);

I analyzed this with IACA (omitting the pointer-increment instructions to simulate unrolling and more clever asm tricks). According to IACA, even the merge-masking vmulps is a single uop, and the memory-source instructions micro-fuses to a single uop for the front-end. (So does the store.) This is what I was hoping, and IACA's output looks correct for this case, although I don't have access to performance counters on SKL-SP hardware to check that.

$ iaca.sh -arch SKX avx512-conditional
Intel(R) Architecture Code Analyzer Version - 2.3 build:246dfea (Thu, 6 Jul 2017 13:38:05 +0300)
Analyzed File - avx512-conditional
Binary Format - 64Bit
Architecture  - SKX
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.50 Cycles       Throughput Bottleneck: FrontEnd

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.5    0.0  | 0.0  | 1.0    1.0  | 1.0    1.0  | 1.0  | 1.5  | 1.0  | 1.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   2^   |           |     | 1.0   1.0 |           |     | 1.0 |     |     | CP | vcmpps k1, zmm0, zmmword ptr [rdx], 0x15
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | vmovups zmm1, zmmword ptr [rsi]
|   1    | 1.0       |     |           |           |     |     |     |     | CP | vmulps zmm2, zmm1, zmm1
|   1    | 0.5       |     |           |           |     | 0.5 |     |     | CP | vmulps zmm2{k1}, zmm2, zmm1
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovups zmmword ptr [rdi], zmm2
|   1    |           |     |           |           |     |     | 1.0 |     |    | sub rcx, 0x10
|   0F   |           |     |           |           |     |     |     |     |    | jnbe 0xffffffffffffffdd
Total Num Of Uops: 8

AVX-512 actually has vfpclassps (C/C++ intrinsic [_mm512_fpclass_ps_mask]4, asm documentation with a table in the related vfpclasspd (packed double)) to classify FP values according to your choice of predicates. It may be slightly more efficient than using a full comparison against another register which happens to be zero.
(Actually, according to IACA, it isn't. Both are listed as 3 cycle latency by the InstLatx64 spreadsheet. Agner Fog's measurement for AVX2 cmpps on Skylake-S (non-AVX512 desktop chips) shows 4 cycles, so it's strange that the AVX512 version is lower latency when producing a mask-register result instead of a vector.

I want the result to be false only for positive numbers, and I think vfpclassps can do that by setting almost all the predicate bits to get -Inf, finite negative, quiet and signalling NaN, -0.0, and +0.0.

vfpclassps    k1, [rdx], 0x1 | 0x2 | 0x4 | 0x10 | 0x40 | 0x80     ; QNaN | -0.0 | +0.0 | -Infinity | Negative (finite) | SNaN
; k1 = a 16-bit bitmap of which elements (from memory at [rdx]) need an extra multiply

vpfclassps is interesting because it lets you differentiate between +0.0 and -0.0, like you could by checking the sign bit in the binary representation (like you could with AVX2 vblendps to use the sign bit as a blend-control, without doing a comparison first).

Also, in this case, it saves one instruction outside the loop setting up a register of all-zeros.


related: AVX512 has instructions to multiply by 2**floor(x) (vscalefpd), but not to raise a number to an arbitrary power (integer or otherwise). Xeon Phi has AVX512ER, which gives you fast approximations for 2**x (without flooring x), but we can't directly use an exponential function here either, and SKL-SP doesn't have AVX512ER anyway.


NASM macros for IACA_start / end:

I wrote these based on the iaca_marks.h C/C++ header.

%if 1
%macro  IACA_start 0
     mov ebx, 111
     db 0x64, 0x67, 0x90
%endmacro
%macro  IACA_end 0
     mov ebx, 222
     db 0x64, 0x67, 0x90
%endmacro
%else
%define IACA_start
%define IACA_end
%endif

Wrap them around any code you want to analyze.


Conditional branch on a loop-invariant condition inside the loop

A compiler could branch inside the loop. IDK if any would make code like this, but they certainly could.

; rdi = destination
; rsi = source
; edx = condition
; rcx = element count
global square_or_cube
square_or_cube: 

 .loop:                          ; do {
    vmovups    zmm1, [rsi]            ; load a vector from b(i)
    vmulps     zmm2, zmm1, zmm1   ; zmm2 = zmm1*zmm1 = b(i)**2

    test       edx,edx
    jz        .only_square        ; test-and-branch to conditionally skip the 2nd multiply
    vmulps     zmm2, zmm2, zmm1   ; zmm2 *= zmm1
   .only_square:

    vmovups    [rdi], zmm2        ; store all 16 elements into a(i)

    add         rdi, 64      ; pointer increments
    add         rsi, 64

    sub         rcx, 16         ;  count -= 16 
    ja        .loop             ; } while(count>0);
like image 168
Peter Cordes Avatar answered Oct 27 '22 14:10

Peter Cordes