I've created two versions of a dot product in .NET using AVX-256 instructions. One uses fused multiply add, and the other separated out into a multiply and and add.
public static unsafe Vector256<double> Dot(double* x, double* y, int n)
{
var vresult = Vector256<double>.Zero;
int i = 0;
for (; i < n; i += 4)
vresult = Avx.Add(Avx.Multiply(Avx.LoadVector256(x + i), Avx.LoadVector256(y + i)), vresult);
return vresult;
}
public static unsafe Vector256<double> Dot2(double* x, double* y, int n)
{
var vresult = Vector256<double>.Zero;
int i = 0;
for (; i < n; i += 4)
vresult = Fma.MultiplyAdd(Avx.LoadVector256(x + i), Avx.LoadVector256(y + i), vresult);
return vresult;
}
This compiles to the following JIT Asm
C.Dot(Double*, Double*, Int32)
L0000: vzeroupper
L0003: vxorps ymm0, ymm0, ymm0
L0008: xor eax, eax
L000a: test r9d, r9d
L000d: jle short L002b
L000f: nop
L0010: movsxd r10, eax
L0013: vmovupd ymm1, [rdx+r10*8]
L0019: vmulpd ymm1, ymm1, [r8+r10*8]
L001f: vaddpd ymm0, ymm1, ymm0
L0023: add eax, 4
L0026: cmp eax, r9d
L0029: jl short L0010
L002b: vmovupd [rcx], ymm0
L002f: mov rax, rcx
L0032: vzeroupper
L0035: ret
C.Dot2(Double*, Double*, Int32)
L0000: vzeroupper
L0003: vxorps ymm0, ymm0, ymm0
L0008: xor eax, eax
L000a: test r9d, r9d
L000d: jle short L002b
L000f: nop
L0010: movsxd r10, eax
L0013: vmovupd ymm1, [rdx+r10*8]
L0019: vfmadd132pd ymm1, ymm0, [r8+r10*8]
L001f: vmovaps ymm0, ymm1
L0023: add eax, 4
L0026: cmp eax, r9d
L0029: jl short L0010
L002b: vmovupd [rcx], ymm0
L002f: mov rax, rcx
L0032: vzeroupper
L0035: ret
When I benchmark this code using my Intel processor and benchmark.net, I see a modest speedup as expected. But when I run it on my AMD Ryzen 5900X, it's about 30% slower on nearly every size of array. Is this a bug in AMD's implementation of vfmadd132pd
or in Microsoft's compiler?
Near duplicate of this Q&A about Unrolling dot-product loops with multiple accumulators - you bottleneck on vaddpd
or vfma...pd
latency, not throughput, and yes Zen3 has lower latency FP vaddpd
(3 cycles) than FMA (4 cycles). https://uops.info/
Intel Skylake / Ice Lake CPUs have 4-cycle latency for all FP add/sub/mul/fma operations, running them on the same execution units. I wouldn't have expected a speedup from FMA, since the front-end shouldn't be a bottleneck. Maybe in the add version, sometimes an independent multiply delays an add by a cycle, hurting the critical path dependency chain? Unlikely; oldest-ready-first uop scheduling should mean the independent work (multiplies) are way ahead of the adds.
Intel Haswell, Broadwell, and Alder Lake have vaddpd
latency of 3 cycles, less than their FMA latencies, so you'd see a benefit there.
But if you unroll with multiple accumulators, you can hide FP latency and bottleneck on throughput. Or on load throughput, since you need 2 loads per FMA.
AMD does have the FP throughput to run 2 multiplies and 2 adds per clock on Zen2 and later, but Intel doesn't. Although with the load bottleneck you'd only get 1 each per clock anyway.
See also Latency bounds and throughput bounds for processors for operations that must occur in sequence re: dependency chains and latency, and https://agner.org/optimize/ (microarchitecture and asm guides.) Also https://uops.info/ for better instruction tables than Agner's.
Your current version is taking advantage of the data parallelism in a dot product using SIMD, more work per instruction. But you're not doing anything to let the CPU find any instruction-level parallelism between those SIMD vector operations. So you're missing one of the three factors that can scale performance (SIMD parallelism, ILP, and thread-level parallelism for huge arrays.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With