Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BMI for generating masks with AVX512

Tags:

x86

simd

avx512

bmi

I was inspired by this link https://www.sigarch.org/simd-instructions-considered-harmful/ to look into how AVX512 performs. My idea was that the clean up loop after the loop could be removed using the AVX512 mask operations.

Here is the code I am using

void daxpy2(int n, double a, const double x[], double y[]) {
  __m512d av = _mm512_set1_pd(a);
  int r = n&7, n2 = n - r;
  for(int i=-n2; i<0; i+=8) {
    __m512d yv = _mm512_loadu_pd(&y[i+n2]);
    __m512d xv = _mm512_loadu_pd(&x[i+n2]);
    yv = _mm512_fmadd_pd(av, xv, yv);
    _mm512_storeu_pd(&y[i+n2], yv);
  }
  __m512d yv = _mm512_loadu_pd(&y[n2]);
  __m512d xv = _mm512_loadu_pd(&x[n2]);
  yv = _mm512_fmadd_pd(av, xv, yv);
  __mmask8 mask = (1 << r) -1;
  //__mmask8 mask = _bextr_u32(-1, 0, r);
  _mm512_mask_storeu_pd(&y[n2], mask, yv);
}

I thought using BMI1 and/or BMI2 instructions could generate masks with fewer instructions. However,

__mmask8 mask = _bextr_u32(-1, 0, r)

is no better (in number of instructions) than

__mmask8 mask = (1 << r) -1;

see https://godbolt.org/z/BFQCM3 and https://godbolt.org/z/tesmB_.

This appears to be due to the fact that _bextr_u32 does a shift by 8 anyway.

Can the mask be generated with fewer instructions (e.g. with BMI or another method) or more optimally?


I have augmented the table in the link with my results for AVX512.

ISA                           | MIPS-32 | AVX2  | RV32V | AVX512 |
******************************|*********|****** |*******|******* |
Instructions(static)          |      22 |   29  |    13 |     28 |
Instructions per Main Loop    |       7 |    6* |    10 |      5*|
Bookkeeping Instructions      |      15 |   23  |     3 |     23 |
Results per Main Loop         |       2 |    4  |    64 |      8 |
Instructions (dynamic n=1000) |    3511 | 1517**|   163 |    645 |

*macro-op fusion will reduce the number of uops in the main loop by 1
** without the unnecessary cmp instructions it would only be 1250+ instructions.

I think if the authors of the link had counted from -n up to 0 instead of from 0 to n they could have skipped the cmp instruction as I have (see the assembly below) in the main loop so for AVX it chould have been 5 instructions in the main loop.

Here is the assembly with ICC19 and -O3 -xCOMMON-AVX512

daxpy2(int, double, double const*, double*):
    mov       eax, edi                                      #6.13
    and       eax, 7                                        #6.13
    movsxd    r9, edi                                       #6.25
    sub       r9, rax                                       #6.21
    mov       ecx, r9d                                      #7.14
    neg       ecx                                           #7.14
    movsxd    rcx, ecx                                      #7.14
    vbroadcastsd zmm16, xmm0                                #5.16
    lea       rdi, QWORD PTR [rsi+r9*8]                     #9.35
    lea       r8, QWORD PTR [rdx+r9*8]                      #8.35
    test      rcx, rcx                                      #7.20
    jge       ..B1.5        # Prob 36%                      #7.20
..B1.3:                         # Preds ..B1.1 ..B1.3
    vmovups   zmm17, ZMMWORD PTR [rdi+rcx*8]                #10.10
    vfmadd213pd zmm17, zmm16, ZMMWORD PTR [r8+rcx*8]        #10.10
    vmovups   ZMMWORD PTR [r8+rcx*8], zmm17                 #11.23
    add       rcx, 8                                        #7.23
    js        ..B1.3        # Prob 82%                      #7.20
..B1.5:                         # Preds ..B1.3 ..B1.1
    vmovups   zmm17, ZMMWORD PTR [rsi+r9*8]                 #15.8
    vfmadd213pd zmm16, zmm17, ZMMWORD PTR [rdx+r9*8]        #15.8
    mov       edx, -1                                       #17.19
    shl       eax, 8                                        #17.19
    bextr     eax, edx, eax                                 #17.19
    kmovw     k1, eax                                       #18.3
    vmovupd   ZMMWORD PTR [r8]{k1}, zmm16                   #18.3
    vzeroupper                                              #19.1
    ret                                                     #19.1

where

    add       r8, 8
    js        ..B1.3

should macro-op fuse to one instruction. However, as pointed out by Peter Cordes in this answer js cannot fuse. The compiler could have generated jl instead which would have fused.


I used Agner Fog's testp utility to get the core clocks (not reference clocks), instructions, uops retired. I did this for SSE2 (actually AVX2 with FMA but with 128-bit vectors), AVX2, and AVX512 for three different variations of looping

v1 = for(int64_t i=0;   i<n;  i+=vec_size) // generates cmp instruction
v2 = for(int64_t i=-n2; i<0;  i+=vec_size) // no cmp but uses js
v3 = for(int64_t i=-n2; i!=0; i+=vec_size) // no cmp and uses jne

vec_size = 2 for SSE, 4 for AVX2, and 8 for AVX512

vec_size version   core cycle    instructions   uops
2        v1        895           3014           3524
2        v2        900           2518           3535
2        v3        870           2518           3035
4        v1        527           1513           1777
4        v2        520           1270           1777
4        v3        517           1270           1541
8        v1        285            765            910
8        v2        285            645            910
8        v3        285            645            790

Notice that core clocks is not really a function of the loop version. It only depends on iterations of the loop. It is proportional to 2*n/vec_size.

SSE     2*1000/2=1000
AVX2    2*1000/4=500
AVX512  2*1000/8=250

The number of instructions does change from v1 to v2 but not between v2 and v3. For v1 it is proportional to 6*n/vec_size and for v2 and v3 5*n/vec_size

Finally, the number of uops is more or less the same for v1 and v2 but drops for v3. For v1 and v2 it is proportional to 7*n/vec_size and for v3 6*n/vec_size.


Here is the result with IACA3 for vec_size=2

Throughput Analysis Report
--------------------------
Block Throughput: 1.49 Cycles       Throughput Bottleneck: FrontEnd
Loop Count:  50
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  0.5     0.0  |  0.5  |  1.5     1.0  |  1.5     1.0  |  1.0  |  0.0  |  0.0  |  0.0  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | vmovupd xmm1, xmmword ptr [r8+rax*8]
|   2      | 0.5         | 0.5  | 0.5     0.5 | 0.5     0.5 |      |      |      |      | vfmadd213pd xmm1, xmm2, xmmword ptr [rcx+rax*8]
|   2      |             |      | 0.5         | 0.5         | 1.0  |      |      |      | vmovups xmmword ptr [rcx+rax*8], xmm1
|   1*     |             |      |             |             |      |      |      |      | add rax, 0x2
|   0*F    |             |      |             |             |      |      |      |      | js 0xffffffffffffffe3
Total Num Of Uops: 6

IACA claims that js macro-fuses with add that disagrees with Agner and the performance counters from the testp utility. See above, v2 is proportional to 7*n/vec_size and v3 proportional to 6*n/vec_size which I infer to mean that js does not macro-fuse.

I think the authors of the link in additional to number of instructions should have also considered core cycles and maybe uops.

like image 903
Z boson Avatar asked Feb 21 '19 14:02

Z boson


People also ask

What is AVX-512 used for?

Intel® Advanced Vector Extensions 512 (Intel® AVX-512) is a set of new instructions that can accelerate performance for workloads and usages such as scientific simulations, financial analytics, artificial intelligence (AI)/deep learning, 3D modeling and analysis, image and audio/video processing, cryptography and data ...

How many AVX2 registers?

AVX uses sixteen YMM registers to perform a single instruction on multiple pieces of data (see SIMD). Each YMM register can hold and do simultaneous operations (math) on: eight 32-bit single-precision floating point numbers or. four 64-bit double-precision floating point numbers.

What is AVX512F?

AVX512F - AVX-512 Foundation is base of the 512-bit SIMD instruction extensions which is a comprehensive list of features for most HPC and enterprise applications. AVX-512 Foundation is the natural extensions to AVX/AVX2 which is extended using the EVEX prefix which builds on the existing VEX prefix.


2 Answers

You can save one instruction if you use the following BMI2 intrinsic:

  __mmask8 mask = _bzhi_u32(-1, r);

instead of __mmask8 mask = (1 << r) -1;. See Godbolt link.

The bzhi instruction zeros the high bits starting at a specified position. With register operands, bzhi has a latency of 1 cycle and a throughput of 2 per cycle.

like image 165
wim Avatar answered Oct 17 '22 11:10

wim


Additionally to @wim's answer of using _bzhi_u32, instead of _bextr_u32, you should:

  • Mask the _mm512_loadu_pd instructions at the end, to avoid loading invalid memory (https://stackoverflow.com/a/54530225), or do arithmetic on non-finite values.
  • Use 64bit integers everywhere (actually either signed or unsigned) to avoid the movsxd sign-extension. This is generally a good advice on 64bit systems, unless you need to store a lot of index variables.
  • Use i!=0 instead of i<0 as loop condition to get a jne instead of js, since this better pairs with the add instruction: https://stackoverflow.com/a/31778403
  • Some minor things, instead of n2=n-r, you could also calculate n2 = n & (-8) or n2 = n ^ r. Not sure, if that makes a relevant difference (icc does not seem to know or to care about that). Godbolt-Link

void daxpy2(size_t n, double a, const double x[], double y[]) {
  __m512d av = _mm512_set1_pd(a);
  size_t r = n&7, n2 = n & (-8);
  for(size_t i=-n2; i!=0; i+=8) {
    __m512d yv = _mm512_loadu_pd(&y[i+n2]);
    __m512d xv = _mm512_loadu_pd(&x[i+n2]);
    yv = _mm512_fmadd_pd(av, xv, yv);
    _mm512_storeu_pd(&y[i+n2], yv);
  }
  __mmask8 mask = _bzhi_u32(-1, r);
  __m512d yv = _mm512_mask_loadu_pd(_mm512_undefined_pd (), mask, &y[n2]);
  __m512d xv = _mm512_mask_loadu_pd(_mm512_undefined_pd (), mask, &x[n2]);
  yv = _mm512_mask_fmadd_pd(av, mask, xv, yv);
  _mm512_mask_storeu_pd(&y[n2], mask, yv);
}

To further reduce the number of instructions, you can use pointer-incrementation, e.g., like this (this increases the instructions inside the loop however).

like image 36
chtz Avatar answered Oct 17 '22 13:10

chtz