I was inspired by this link https://www.sigarch.org/simd-instructions-considered-harmful/ to look into how AVX512 performs. My idea was that the clean up loop after the loop could be removed using the AVX512 mask operations.
Here is the code I am using
void daxpy2(int n, double a, const double x[], double y[]) {
__m512d av = _mm512_set1_pd(a);
int r = n&7, n2 = n - r;
for(int i=-n2; i<0; i+=8) {
__m512d yv = _mm512_loadu_pd(&y[i+n2]);
__m512d xv = _mm512_loadu_pd(&x[i+n2]);
yv = _mm512_fmadd_pd(av, xv, yv);
_mm512_storeu_pd(&y[i+n2], yv);
}
__m512d yv = _mm512_loadu_pd(&y[n2]);
__m512d xv = _mm512_loadu_pd(&x[n2]);
yv = _mm512_fmadd_pd(av, xv, yv);
__mmask8 mask = (1 << r) -1;
//__mmask8 mask = _bextr_u32(-1, 0, r);
_mm512_mask_storeu_pd(&y[n2], mask, yv);
}
I thought using BMI1 and/or BMI2 instructions could generate masks with fewer instructions. However,
__mmask8 mask = _bextr_u32(-1, 0, r)
is no better (in number of instructions) than
__mmask8 mask = (1 << r) -1;
see https://godbolt.org/z/BFQCM3 and https://godbolt.org/z/tesmB_.
This appears to be due to the fact that _bextr_u32 does a shift by 8 anyway.
Can the mask be generated with fewer instructions (e.g. with BMI or another method) or more optimally?
I have augmented the table in the link with my results for AVX512.
ISA | MIPS-32 | AVX2 | RV32V | AVX512 |
******************************|*********|****** |*******|******* |
Instructions(static) | 22 | 29 | 13 | 28 |
Instructions per Main Loop | 7 | 6* | 10 | 5*|
Bookkeeping Instructions | 15 | 23 | 3 | 23 |
Results per Main Loop | 2 | 4 | 64 | 8 |
Instructions (dynamic n=1000) | 3511 | 1517**| 163 | 645 |
*macro-op fusion will reduce the number of uops in the main loop by 1
** without the unnecessary cmp instructions it would only be 1250+ instructions.
I think if the authors of the link had counted from -n
up to 0
instead of from 0
to n
they could have skipped the cmp
instruction as I have (see the assembly below) in the main loop so for AVX it chould have been 5 instructions in the main loop.
Here is the assembly with ICC19 and -O3 -xCOMMON-AVX512
daxpy2(int, double, double const*, double*):
mov eax, edi #6.13
and eax, 7 #6.13
movsxd r9, edi #6.25
sub r9, rax #6.21
mov ecx, r9d #7.14
neg ecx #7.14
movsxd rcx, ecx #7.14
vbroadcastsd zmm16, xmm0 #5.16
lea rdi, QWORD PTR [rsi+r9*8] #9.35
lea r8, QWORD PTR [rdx+r9*8] #8.35
test rcx, rcx #7.20
jge ..B1.5 # Prob 36% #7.20
..B1.3: # Preds ..B1.1 ..B1.3
vmovups zmm17, ZMMWORD PTR [rdi+rcx*8] #10.10
vfmadd213pd zmm17, zmm16, ZMMWORD PTR [r8+rcx*8] #10.10
vmovups ZMMWORD PTR [r8+rcx*8], zmm17 #11.23
add rcx, 8 #7.23
js ..B1.3 # Prob 82% #7.20
..B1.5: # Preds ..B1.3 ..B1.1
vmovups zmm17, ZMMWORD PTR [rsi+r9*8] #15.8
vfmadd213pd zmm16, zmm17, ZMMWORD PTR [rdx+r9*8] #15.8
mov edx, -1 #17.19
shl eax, 8 #17.19
bextr eax, edx, eax #17.19
kmovw k1, eax #18.3
vmovupd ZMMWORD PTR [r8]{k1}, zmm16 #18.3
vzeroupper #19.1
ret #19.1
where
add r8, 8
js ..B1.3
should macro-op fuse to one instruction. However, as pointed out by Peter Cordes in this answer js cannot fuse. The compiler could have generated jl
instead which would have fused.
I used Agner Fog's testp utility to get the core clocks (not reference clocks), instructions, uops retired. I did this for SSE2 (actually AVX2 with FMA but with 128-bit vectors), AVX2, and AVX512 for three different variations of looping
v1 = for(int64_t i=0; i<n; i+=vec_size) // generates cmp instruction
v2 = for(int64_t i=-n2; i<0; i+=vec_size) // no cmp but uses js
v3 = for(int64_t i=-n2; i!=0; i+=vec_size) // no cmp and uses jne
vec_size = 2 for SSE, 4 for AVX2, and 8 for AVX512
vec_size version core cycle instructions uops
2 v1 895 3014 3524
2 v2 900 2518 3535
2 v3 870 2518 3035
4 v1 527 1513 1777
4 v2 520 1270 1777
4 v3 517 1270 1541
8 v1 285 765 910
8 v2 285 645 910
8 v3 285 645 790
Notice that core clocks is not really a function of the loop version. It only depends on iterations of the loop. It is proportional to 2*n/vec_size
.
SSE 2*1000/2=1000
AVX2 2*1000/4=500
AVX512 2*1000/8=250
The number of instructions does change from v1 to v2 but not between v2 and v3. For v1 it is proportional to 6*n/vec_size
and for v2 and v3 5*n/vec_size
Finally, the number of uops is more or less the same for v1 and v2 but drops for v3. For v1 and v2 it is proportional to 7*n/vec_size
and for v3 6*n/vec_size
.
Here is the result with IACA3 for vec_size=2
Throughput Analysis Report
--------------------------
Block Throughput: 1.49 Cycles Throughput Bottleneck: FrontEnd
Loop Count: 50
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 0.5 0.0 | 0.5 | 1.5 1.0 | 1.5 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovupd xmm1, xmmword ptr [r8+rax*8]
| 2 | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vfmadd213pd xmm1, xmm2, xmmword ptr [rcx+rax*8]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rcx+rax*8], xmm1
| 1* | | | | | | | | | add rax, 0x2
| 0*F | | | | | | | | | js 0xffffffffffffffe3
Total Num Of Uops: 6
IACA claims that js
macro-fuses with add
that disagrees with Agner and the performance counters from the testp
utility. See above, v2 is proportional to 7*n/vec_size
and v3 proportional to 6*n/vec_size
which I infer to mean that js
does not macro-fuse.
I think the authors of the link in additional to number of instructions should have also considered core cycles and maybe uops.
Intel® Advanced Vector Extensions 512 (Intel® AVX-512) is a set of new instructions that can accelerate performance for workloads and usages such as scientific simulations, financial analytics, artificial intelligence (AI)/deep learning, 3D modeling and analysis, image and audio/video processing, cryptography and data ...
AVX uses sixteen YMM registers to perform a single instruction on multiple pieces of data (see SIMD). Each YMM register can hold and do simultaneous operations (math) on: eight 32-bit single-precision floating point numbers or. four 64-bit double-precision floating point numbers.
AVX512F - AVX-512 Foundation is base of the 512-bit SIMD instruction extensions which is a comprehensive list of features for most HPC and enterprise applications. AVX-512 Foundation is the natural extensions to AVX/AVX2 which is extended using the EVEX prefix which builds on the existing VEX prefix.
You can save one instruction if you use the following BMI2 intrinsic:
__mmask8 mask = _bzhi_u32(-1, r);
instead of __mmask8 mask = (1 << r) -1;
. See Godbolt link.
The bzhi
instruction zeros the high bits starting at a specified position. With register operands, bzhi
has a latency of 1 cycle and a throughput of 2 per cycle.
Additionally to @wim's answer of using _bzhi_u32
, instead of _bextr_u32
, you should:
_mm512_loadu_pd
instructions at the end, to avoid loading invalid memory (https://stackoverflow.com/a/54530225), or do arithmetic on non-finite values.movsxd
sign-extension. This is generally a good advice on 64bit systems, unless you need to store a lot of index variables.i!=0
instead of i<0
as loop condition to get a jne
instead of js
, since this better pairs with the add
instruction: https://stackoverflow.com/a/31778403
n2=n-r
, you could also calculate n2 = n & (-8)
or n2 = n ^ r
. Not sure, if that makes a relevant difference (icc does not seem to know or to care about that). Godbolt-Link
void daxpy2(size_t n, double a, const double x[], double y[]) {
__m512d av = _mm512_set1_pd(a);
size_t r = n&7, n2 = n & (-8);
for(size_t i=-n2; i!=0; i+=8) {
__m512d yv = _mm512_loadu_pd(&y[i+n2]);
__m512d xv = _mm512_loadu_pd(&x[i+n2]);
yv = _mm512_fmadd_pd(av, xv, yv);
_mm512_storeu_pd(&y[i+n2], yv);
}
__mmask8 mask = _bzhi_u32(-1, r);
__m512d yv = _mm512_mask_loadu_pd(_mm512_undefined_pd (), mask, &y[n2]);
__m512d xv = _mm512_mask_loadu_pd(_mm512_undefined_pd (), mask, &x[n2]);
yv = _mm512_mask_fmadd_pd(av, mask, xv, yv);
_mm512_mask_storeu_pd(&y[n2], mask, yv);
}
To further reduce the number of instructions, you can use pointer-incrementation, e.g., like this (this increases the instructions inside the loop however).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With