Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is vfmadd132pd slow on AMD Zen 3 architecture?

I've created two versions of a dot product in .NET using AVX-256 instructions. One uses fused multiply add, and the other separated out into a multiply and and add.

public static unsafe Vector256<double> Dot(double* x, double* y, int n)
{
    var vresult = Vector256<double>.Zero;
    int i = 0;

    for (; i < n; i += 4)
        vresult = Avx.Add(Avx.Multiply(Avx.LoadVector256(x + i), Avx.LoadVector256(y + i)), vresult);

    return vresult;
}

public static unsafe Vector256<double> Dot2(double* x, double* y, int n)
{
    var vresult = Vector256<double>.Zero;
    int i = 0;

    for (; i < n; i += 4)
        vresult = Fma.MultiplyAdd(Avx.LoadVector256(x + i), Avx.LoadVector256(y + i), vresult);

    return vresult;
}

This compiles to the following JIT Asm

C.Dot(Double*, Double*, Int32)
    L0000: vzeroupper
    L0003: vxorps ymm0, ymm0, ymm0
    L0008: xor eax, eax
    L000a: test r9d, r9d
    L000d: jle short L002b
    L000f: nop
    L0010: movsxd r10, eax
    L0013: vmovupd ymm1, [rdx+r10*8]
    L0019: vmulpd ymm1, ymm1, [r8+r10*8]
    L001f: vaddpd ymm0, ymm1, ymm0
    L0023: add eax, 4
    L0026: cmp eax, r9d
    L0029: jl short L0010
    L002b: vmovupd [rcx], ymm0
    L002f: mov rax, rcx
    L0032: vzeroupper
    L0035: ret

C.Dot2(Double*, Double*, Int32)
    L0000: vzeroupper
    L0003: vxorps ymm0, ymm0, ymm0
    L0008: xor eax, eax
    L000a: test r9d, r9d
    L000d: jle short L002b
    L000f: nop
    L0010: movsxd r10, eax
    L0013: vmovupd ymm1, [rdx+r10*8]
    L0019: vfmadd132pd ymm1, ymm0, [r8+r10*8]
    L001f: vmovaps ymm0, ymm1
    L0023: add eax, 4
    L0026: cmp eax, r9d
    L0029: jl short L0010
    L002b: vmovupd [rcx], ymm0
    L002f: mov rax, rcx
    L0032: vzeroupper
    L0035: ret

When I benchmark this code using my Intel processor and benchmark.net, I see a modest speedup as expected. But when I run it on my AMD Ryzen 5900X, it's about 30% slower on nearly every size of array. Is this a bug in AMD's implementation of vfmadd132pd or in Microsoft's compiler?

like image 829
kolbe Avatar asked Sep 07 '25 08:09

kolbe


1 Answers

Near duplicate of this Q&A about Unrolling dot-product loops with multiple accumulators - you bottleneck on vaddpd or vfma...pd latency, not throughput, and yes Zen3 has lower latency FP vaddpd (3 cycles) than FMA (4 cycles). https://uops.info/

Intel Skylake / Ice Lake CPUs have 4-cycle latency for all FP add/sub/mul/fma operations, running them on the same execution units. I wouldn't have expected a speedup from FMA, since the front-end shouldn't be a bottleneck. Maybe in the add version, sometimes an independent multiply delays an add by a cycle, hurting the critical path dependency chain? Unlikely; oldest-ready-first uop scheduling should mean the independent work (multiplies) are way ahead of the adds.

Intel Haswell, Broadwell, and Alder Lake have vaddpd latency of 3 cycles, less than their FMA latencies, so you'd see a benefit there.


But if you unroll with multiple accumulators, you can hide FP latency and bottleneck on throughput. Or on load throughput, since you need 2 loads per FMA.

AMD does have the FP throughput to run 2 multiplies and 2 adds per clock on Zen2 and later, but Intel doesn't. Although with the load bottleneck you'd only get 1 each per clock anyway.

See also Latency bounds and throughput bounds for processors for operations that must occur in sequence re: dependency chains and latency, and https://agner.org/optimize/ (microarchitecture and asm guides.) Also https://uops.info/ for better instruction tables than Agner's.

Your current version is taking advantage of the data parallelism in a dot product using SIMD, more work per instruction. But you're not doing anything to let the CPU find any instruction-level parallelism between those SIMD vector operations. So you're missing one of the three factors that can scale performance (SIMD parallelism, ILP, and thread-level parallelism for huge arrays.)

like image 165
Peter Cordes Avatar answered Sep 10 '25 00:09

Peter Cordes