I was inspired by this link https://www.sigarch.org/simd-instructions-considered-harmful/ to look into how AVX512 performs. My idea was that the clean up loop after the loop could be removed using the AVX512 mask operations. Here is the code I am using <pre class="prettyprint"><code>void daxpy2(int n, double a, const double x[], double y[]) { __m512d av = _mm512_set1_pd(a); int r = n&7, n2 = n - r; for(int i=-n2; i<0; i+=8) { __m512d yv = _mm512_loadu_pd(&y[i+n2]); __m512d xv = _mm512_loadu_pd(&x[i+n2]); yv = _mm512_fmadd_pd(av, xv, yv); _mm512_storeu_pd(&y[i+n2], yv); } __m512d yv = _mm512_loadu_pd(&y[n2]); __m512d xv = _mm512_loadu_pd(&x[n2]); yv = _mm512_fmadd_pd(av, xv, yv); __mmask8 mask = (1 << r) -1; //__mmask8 mask = _bextr_u32(-1, 0, r); _mm512_mask_storeu_pd(&y[n2], mask, yv); } </code></pre> I thought using BMI1 and/or BMI2 instructions could generate masks with fewer instructions. However, <pre class="prettyprint"><code>__mmask8 mask = _bextr_u32(-1, 0, r) </code></pre> is no better (in number of instructions) than <pre class="prettyprint"><code>__mmask8 mask = (1 << r) -1; </code></pre> see https://godbolt.org/z/BFQCM3 and https://godbolt.org/z/tesmB_. This appears to be due to the fact that _bextr_u32 does a shift by 8 anyway. Can the mask be generated with fewer instructions (e.g. with BMI or another method) or more optimally? <hr> I have augmented the table in the link with my results for AVX512. <pre class="prettyprint"><code>ISA | MIPS-32 | AVX2 | RV32V | AVX512 | ******************************|*********|****** |*******|******* | Instructions(static) | 22 | 29 | 13 | 28 | Instructions per Main Loop | 7 | 6* | 10 | 5*| Bookkeeping Instructions | 15 | 23 | 3 | 23 | Results per Main Loop | 2 | 4 | 64 | 8 | Instructions (dynamic n=1000) | 3511 | 1517**| 163 | 645 | *macro-op fusion will reduce the number of uops in the main loop by 1 ** without the unnecessary cmp instructions it would only be 1250+ instructions. </code></pre> I think if the authors of the link had counted from <code>-n</code> up to <code>0</code> instead of from <code>0</code> to <code>n</code> they could have skipped the <code>cmp</code> instruction as I have (see the assembly below) in the main loop so for AVX it chould have been 5 instructions in the main loop. Here is the assembly with ICC19 and <code>-O3 -xCOMMON-AVX512</code> <pre class="prettyprint"><code>daxpy2(int, double, double const*, double*): mov eax, edi #6.13 and eax, 7 #6.13 movsxd r9, edi #6.25 sub r9, rax #6.21 mov ecx, r9d #7.14 neg ecx #7.14 movsxd rcx, ecx #7.14 vbroadcastsd zmm16, xmm0 #5.16 lea rdi, QWORD PTR [rsi+r9*8] #9.35 lea r8, QWORD PTR [rdx+r9*8] #8.35 test rcx, rcx #7.20 jge ..B1.5 # Prob 36% #7.20 ..B1.3: # Preds ..B1.1 ..B1.3 vmovups zmm17, ZMMWORD PTR [rdi+rcx*8] #10.10 vfmadd213pd zmm17, zmm16, ZMMWORD PTR [r8+rcx*8] #10.10 vmovups ZMMWORD PTR [r8+rcx*8], zmm17 #11.23 add rcx, 8 #7.23 js ..B1.3 # Prob 82% #7.20 ..B1.5: # Preds ..B1.3 ..B1.1 vmovups zmm17, ZMMWORD PTR [rsi+r9*8] #15.8 vfmadd213pd zmm16, zmm17, ZMMWORD PTR [rdx+r9*8] #15.8 mov edx, -1 #17.19 shl eax, 8 #17.19 bextr eax, edx, eax #17.19 kmovw k1, eax #18.3 vmovupd ZMMWORD PTR [r8]{k1}, zmm16 #18.3 vzeroupper #19.1 ret #19.1 </code></pre> where <pre class="prettyprint"><code> add r8, 8 js ..B1.3 </code></pre> should macro-op fuse to one instruction. However, as pointed out by Peter Cordes in this answer js cannot fuse. The compiler could have generated <code>jl</code> instead which would have fused. <hr> I used Agner Fog's testp utility to get the core clocks (not reference clocks), instructions, uops retired. I did this for SSE2 (actually AVX2 with FMA but with 128-bit vectors), AVX2, and AVX512 for three different variations of looping <pre class="prettyprint"><code>v1 = for(int64_t i=0; i<n; i+=vec_size) // generates cmp instruction v2 = for(int64_t i=-n2; i<0; i+=vec_size) // no cmp but uses js v3 = for(int64_t i=-n2; i!=0; i+=vec_size) // no cmp and uses jne vec_size = 2 for SSE, 4 for AVX2, and 8 for AVX512 vec_size version core cycle instructions uops 2 v1 895 3014 3524 2 v2 900 2518 3535 2 v3 870 2518 3035 4 v1 527 1513 1777 4 v2 520 1270 1777 4 v3 517 1270 1541 8 v1 285 765 910 8 v2 285 645 910 8 v3 285 645 790 </code></pre> Notice that core clocks is not really a function of the loop version. It only depends on iterations of the loop. It is proportional to <code>2*n/vec_size</code>. <pre class="prettyprint"><code>SSE 2*1000/2=1000 AVX2 2*1000/4=500 AVX512 2*1000/8=250 </code></pre> The number of instructions does change from v1 to v2 but not between v2 and v3. For v1 it is proportional to <code>6*n/vec_size</code> and for v2 and v3 <code>5*n/vec_size</code> Finally, the number of uops is more or less the same for v1 and v2 but drops for v3. For v1 and v2 it is proportional to <code>7*n/vec_size</code> and for v3 <code>6*n/vec_size</code>. <hr> Here is the result with IACA3 for vec_size=2 <pre class="prettyprint"><code>Throughput Analysis Report -------------------------- Block Throughput: 1.49 Cycles Throughput Bottleneck: FrontEnd Loop Count: 50 Port Binding In Cycles Per Iteration: -------------------------------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | -------------------------------------------------------------------------------------------------- | Cycles | 0.5 0.0 | 0.5 | 1.5 1.0 | 1.5 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | -------------------------------------------------------------------------------------------------- DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3) F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion occurred # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected X - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | ----------------------------------------------------------------------------------------- | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovupd xmm1, xmmword ptr [r8+rax*8] | 2 | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vfmadd213pd xmm1, xmm2, xmmword ptr [rcx+rax*8] | 2 | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rcx+rax*8], xmm1 | 1* | | | | | | | | | add rax, 0x2 | 0*F | | | | | | | | | js 0xffffffffffffffe3 Total Num Of Uops: 6 </code></pre> IACA claims that <code>js</code> macro-fuses with <code>add</code> that disagrees with Agner and the performance counters from the <code>testp</code> utility. See above, v2 is proportional to <code>7*n/vec_size</code> and v3 proportional to <code>6*n/vec_size</code> which I infer to mean that <code>js</code> does not macro-fuse. I think the authors of the link in additional to number of instructions should have also considered core cycles and maybe uops.

You can save one instruction if you use the following BMI2 intrinsic: <pre class="prettyprint"><code> __mmask8 mask = _bzhi_u32(-1, r); </code></pre> instead of <code>__mmask8 mask = (1 << r) -1;</code>. See Godbolt link. The <code>bzhi</code> instruction zeros the high bits starting at a specified position. With register operands, <code>bzhi</code> has a latency of 1 cycle and a throughput of 2 per cycle.

Additionally to @wim's answer of using <code>_bzhi_u32</code>, instead of <code>_bextr_u32</code>, you should: <ul> <li>Mask the <code>_mm512_loadu_pd</code> instructions at the end, to avoid loading invalid memory (https://stackoverflow.com/a/54530225), or do arithmetic on non-finite values.</li> <li>Use 64bit integers everywhere (actually either signed or unsigned) to avoid the <code>movsxd</code> sign-extension. This is generally a good advice on 64bit systems, unless you need to store a lot of index variables.</li> <li>Use <code>i!=0</code> instead of <code>i<0</code> as loop condition to get a <code>jne</code> instead of <code>js</code>, since this better pairs with the <code>add</code> instruction: https://stackoverflow.com/a/31778403 </li> <li>Some minor things, instead of <code>n2=n-r</code>, you could also calculate <code>n2 = n & (-8)</code> or <code>n2 = n ^ r</code>. Not sure, if that makes a relevant difference (icc does not seem to know or to care about that). Godbolt-Link </li> </ul> <hr> <pre class="prettyprint"><code>void daxpy2(size_t n, double a, const double x[], double y[]) { __m512d av = _mm512_set1_pd(a); size_t r = n&7, n2 = n & (-8); for(size_t i=-n2; i!=0; i+=8) { __m512d yv = _mm512_loadu_pd(&y[i+n2]); __m512d xv = _mm512_loadu_pd(&x[i+n2]); yv = _mm512_fmadd_pd(av, xv, yv); _mm512_storeu_pd(&y[i+n2], yv); } __mmask8 mask = _bzhi_u32(-1, r); __m512d yv = _mm512_mask_loadu_pd(_mm512_undefined_pd (), mask, &y[n2]); __m512d xv = _mm512_mask_loadu_pd(_mm512_undefined_pd (), mask, &x[n2]); yv = _mm512_mask_fmadd_pd(av, mask, xv, yv); _mm512_mask_storeu_pd(&y[n2], mask, yv); } </code></pre> To further reduce the number of instructions, you can use pointer-incrementation, e.g., like this (this increases the instructions inside the loop however).

BMI for generating masks with AVX512

Tags:

x86

simd

avx512

bmi

I was inspired by this link https://www.sigarch.org/simd-instructions-considered-harmful/ to look into how AVX512 performs. My idea was that the clean up loop after the loop could be removed using the AVX512 mask operations.

Here is the code I am using

void daxpy2(int n, double a, const double x[], double y[]) {
  __m512d av = _mm512_set1_pd(a);
  int r = n&7, n2 = n - r;
  for(int i=-n2; i<0; i+=8) {
    __m512d yv = _mm512_loadu_pd(&y[i+n2]);
    __m512d xv = _mm512_loadu_pd(&x[i+n2]);
    yv = _mm512_fmadd_pd(av, xv, yv);
    _mm512_storeu_pd(&y[i+n2], yv);
  }
  __m512d yv = _mm512_loadu_pd(&y[n2]);
  __m512d xv = _mm512_loadu_pd(&x[n2]);
  yv = _mm512_fmadd_pd(av, xv, yv);
  __mmask8 mask = (1 << r) -1;
  //__mmask8 mask = _bextr_u32(-1, 0, r);
  _mm512_mask_storeu_pd(&y[n2], mask, yv);
}

I thought using BMI1 and/or BMI2 instructions could generate masks with fewer instructions. However,

__mmask8 mask = _bextr_u32(-1, 0, r)

is no better (in number of instructions) than

__mmask8 mask = (1 << r) -1;

see https://godbolt.org/z/BFQCM3 and https://godbolt.org/z/tesmB_.

This appears to be due to the fact that _bextr_u32 does a shift by 8 anyway.

Can the mask be generated with fewer instructions (e.g. with BMI or another method) or more optimally?

I have augmented the table in the link with my results for AVX512.

ISA                           | MIPS-32 | AVX2  | RV32V | AVX512 |
******************************|*********|****** |*******|******* |
Instructions(static)          |      22 |   29  |    13 |     28 |
Instructions per Main Loop    |       7 |    6* |    10 |      5*|
Bookkeeping Instructions      |      15 |   23  |     3 |     23 |
Results per Main Loop         |       2 |    4  |    64 |      8 |
Instructions (dynamic n=1000) |    3511 | 1517**|   163 |    645 |

*macro-op fusion will reduce the number of uops in the main loop by 1
** without the unnecessary cmp instructions it would only be 1250+ instructions.

I think if the authors of the link had counted from -n up to 0 instead of from 0 to n they could have skipped the cmp instruction as I have (see the assembly below) in the main loop so for AVX it chould have been 5 instructions in the main loop.

Here is the assembly with ICC19 and -O3 -xCOMMON-AVX512

daxpy2(int, double, double const*, double*):
    mov       eax, edi                                      #6.13
    and       eax, 7                                        #6.13
    movsxd    r9, edi                                       #6.25
    sub       r9, rax                                       #6.21
    mov       ecx, r9d                                      #7.14
    neg       ecx                                           #7.14
    movsxd    rcx, ecx                                      #7.14
    vbroadcastsd zmm16, xmm0                                #5.16
    lea       rdi, QWORD PTR [rsi+r9*8]                     #9.35
    lea       r8, QWORD PTR [rdx+r9*8]                      #8.35
    test      rcx, rcx                                      #7.20
    jge       ..B1.5        # Prob 36%                      #7.20
..B1.3:                         # Preds ..B1.1 ..B1.3
    vmovups   zmm17, ZMMWORD PTR [rdi+rcx*8]                #10.10
    vfmadd213pd zmm17, zmm16, ZMMWORD PTR [r8+rcx*8]        #10.10
    vmovups   ZMMWORD PTR [r8+rcx*8], zmm17                 #11.23
    add       rcx, 8                                        #7.23
    js        ..B1.3        # Prob 82%                      #7.20
..B1.5:                         # Preds ..B1.3 ..B1.1
    vmovups   zmm17, ZMMWORD PTR [rsi+r9*8]                 #15.8
    vfmadd213pd zmm16, zmm17, ZMMWORD PTR [rdx+r9*8]        #15.8
    mov       edx, -1                                       #17.19
    shl       eax, 8                                        #17.19
    bextr     eax, edx, eax                                 #17.19
    kmovw     k1, eax                                       #18.3
    vmovupd   ZMMWORD PTR [r8]{k1}, zmm16                   #18.3
    vzeroupper                                              #19.1
    ret                                                     #19.1

where

    add       r8, 8
    js        ..B1.3

should macro-op fuse to one instruction. However, as pointed out by Peter Cordes in this answer js cannot fuse. The compiler could have generated jl instead which would have fused.

I used Agner Fog's testp utility to get the core clocks (not reference clocks), instructions, uops retired. I did this for SSE2 (actually AVX2 with FMA but with 128-bit vectors), AVX2, and AVX512 for three different variations of looping

v1 = for(int64_t i=0;   i<n;  i+=vec_size) // generates cmp instruction
v2 = for(int64_t i=-n2; i<0;  i+=vec_size) // no cmp but uses js
v3 = for(int64_t i=-n2; i!=0; i+=vec_size) // no cmp and uses jne

vec_size = 2 for SSE, 4 for AVX2, and 8 for AVX512

vec_size version   core cycle    instructions   uops
2        v1        895           3014           3524
2        v2        900           2518           3535
2        v3        870           2518           3035
4        v1        527           1513           1777
4        v2        520           1270           1777
4        v3        517           1270           1541
8        v1        285            765            910
8        v2        285            645            910
8        v3        285            645            790

Notice that core clocks is not really a function of the loop version. It only depends on iterations of the loop. It is proportional to 2*n/vec_size.

SSE     2*1000/2=1000
AVX2    2*1000/4=500
AVX512  2*1000/8=250

The number of instructions does change from v1 to v2 but not between v2 and v3. For v1 it is proportional to 6*n/vec_size and for v2 and v3 5*n/vec_size

Finally, the number of uops is more or less the same for v1 and v2 but drops for v3. For v1 and v2 it is proportional to 7*n/vec_size and for v3 6*n/vec_size.

Here is the result with IACA3 for vec_size=2

Throughput Analysis Report
--------------------------
Block Throughput: 1.49 Cycles       Throughput Bottleneck: FrontEnd
Loop Count:  50
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  0.5     0.0  |  0.5  |  1.5     1.0  |  1.5     1.0  |  1.0  |  0.0  |  0.0  |  0.0  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | vmovupd xmm1, xmmword ptr [r8+rax*8]
|   2      | 0.5         | 0.5  | 0.5     0.5 | 0.5     0.5 |      |      |      |      | vfmadd213pd xmm1, xmm2, xmmword ptr [rcx+rax*8]
|   2      |             |      | 0.5         | 0.5         | 1.0  |      |      |      | vmovups xmmword ptr [rcx+rax*8], xmm1
|   1*     |             |      |             |             |      |      |      |      | add rax, 0x2
|   0*F    |             |      |             |             |      |      |      |      | js 0xffffffffffffffe3
Total Num Of Uops: 6

IACA claims that js macro-fuses with add that disagrees with Agner and the performance counters from the testp utility. See above, v2 is proportional to 7*n/vec_size and v3 proportional to 6*n/vec_size which I infer to mean that js does not macro-fuse.

I think the authors of the link in additional to number of instructions should have also considered core cycles and maybe uops.

903

asked Feb 21 '19 14:02

Z boson

2 Answers

You can save one instruction if you use the following BMI2 intrinsic:

  __mmask8 mask = _bzhi_u32(-1, r);

instead of __mmask8 mask = (1 << r) -1;. See Godbolt link.

The bzhi instruction zeros the high bits starting at a specified position. With register operands, bzhi has a latency of 1 cycle and a throughput of 2 per cycle.

165

answered Oct 17 '22 11:10

wim

Additionally to @wim's answer of using _bzhi_u32, instead of _bextr_u32, you should:

Mask the _mm512_loadu_pd instructions at the end, to avoid loading invalid memory (https://stackoverflow.com/a/54530225), or do arithmetic on non-finite values.
Use 64bit integers everywhere (actually either signed or unsigned) to avoid the movsxd sign-extension. This is generally a good advice on 64bit systems, unless you need to store a lot of index variables.
Use i!=0 instead of i<0 as loop condition to get a jne instead of js, since this better pairs with the add instruction: https://stackoverflow.com/a/31778403
Some minor things, instead of n2=n-r, you could also calculate n2 = n & (-8) or n2 = n ^ r. Not sure, if that makes a relevant difference (icc does not seem to know or to care about that). Godbolt-Link

void daxpy2(size_t n, double a, const double x[], double y[]) {
  __m512d av = _mm512_set1_pd(a);
  size_t r = n&7, n2 = n & (-8);
  for(size_t i=-n2; i!=0; i+=8) {
    __m512d yv = _mm512_loadu_pd(&y[i+n2]);
    __m512d xv = _mm512_loadu_pd(&x[i+n2]);
    yv = _mm512_fmadd_pd(av, xv, yv);
    _mm512_storeu_pd(&y[i+n2], yv);
  }
  __mmask8 mask = _bzhi_u32(-1, r);
  __m512d yv = _mm512_mask_loadu_pd(_mm512_undefined_pd (), mask, &y[n2]);
  __m512d xv = _mm512_mask_loadu_pd(_mm512_undefined_pd (), mask, &x[n2]);
  yv = _mm512_mask_fmadd_pd(av, mask, xv, yv);
  _mm512_mask_storeu_pd(&y[n2], mask, yv);
}

To further reduce the number of instructions, you can use pointer-incrementation, e.g., like this (this increases the instructions inside the loop however).

answered Oct 17 '22 13:10

chtz

Related questions
                            
                                How can I find out which cache line is touched by an instruction on an Intel processor?
                            
                                x86 XOR opcode differences
                            
                                Do prefetch instructions need to return their result before they retire?
                            
                                The inner workings of Spectre (v2)
                            
                                Why doesn't Ice Lake have MOVDIRx like tremont? Do they already have better ones?
                            
                                x86 Assembly Why use Push/Pop instead of Mov?
                            
                                GCC wrongly optimizes a pointer-equality test for a variable at a custom address
                            
                                Are the Linux/SMP spinlocks unnecessarily slow?
                            
                                Intel 64 and IA-32 | Atomic operations including acquire / release semantic
                            
                                On x86 if [mem] is not 32-bit aligned, can "lock inc [mem]" still work fine?
                            
                                jmp FWORD PTR [eax-0x67]?
                            
                                x86 Assembly Force Cache Store
                            
                                Assembly language using shl to multiply by an odd number?
                            
                                How does endianness work with SIMD registers?
                            
                                Does QEMU emulate TLB?
                            
                                What is lifetime of variable inside block?
                            
                                How to switch from real mode to protected mode after bootloader?
                            
                                Is the example in the membarrier man page pointless in x86?
                            
                                LFENCE is really useless vs. Spectre #2?
                            
                                what is jmpl instruction in x86?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With