Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is this code using VMULPD to write registers that will be overwritten by VFMADD? Isn't that useless?

Tags:

assembly

avx

fma

While reviewing this piece of code, I noticed the following four instructions:

vmulpd  %ymm1,%ymm3,%ymm4 /* aim*bim */
vmulpd  %ymm0,%ymm3,%ymm6 /* are*bim */
vfmadd231pd %ymm2,%ymm1,%ymm6
vfmsub231pd %ymm0,%ymm2,%ymm4

Now, if you consider that in AT&T notation the instructions are in the form operator source,source,destination, aren't the first instructions useless?

%ymm4 = f(%ymm1, %ymm3)
%ymm6 = f(%ymm0, %ymm3)
%ymm6 = f(%ymm2, %ymm1)
%ymm4 = f(%ymm0, %ymm2)

The first two values are clearly never read, so they shouldn't be calculated. However, it seems that this is not the case, since tests fail if I remove these lines.

like image 416
Giulio Muscarello Avatar asked Mar 08 '23 09:03

Giulio Muscarello


1 Answers

FMA is a 3-input instruction, computing a * b + c; The destination is a read-write operand (like with SSE2 mulpd %xmm0, %xmm1).

FMADD/FMSUB/FNMADD/FNMSUB (and even FMADDSUB / FMSUBADD) instructions each come in 3 operand-orders to give you a choice of which of the 3 operands (a, b, or c) is the read-write destination operand, and which one can be a memory operand. See the docs for vfmadd231pd/ 132PD / 213PD to see which inputs are multiplied and which is the "accumulator" in your code. (I can never keep the numbering scheme straight in my head: this is one case where writing with intrinsics is much easier. But the destination is still always last.)

Note that the Intel docs use Intel syntax, dst, src1, src2, .... Reverse the list of operands to get AT&T syntax, e.g. ..., src2, src1, dst. See the at&t-syntax tag wiki, and also the intel-syntax tag wiki.


BTW, there is an FMA4 ISA-extension, where FMA instructions have 3 inputs and a separate output. See https://en.wikipedia.org/wiki/FMA_instruction_set.

Intel was originally going to implement FMA4, but then changed to the current FMA3 (without telling AMD until as late as possible, for anti-competitive reasons: See Agner Fog's blog post Stop the instruction set war). For AMD Bulldozer, it was too late to change, so Bulldozer only supports FMA4. Piledriver supports FMA3 and FMA4. Ryzen unofficially supports both until Zen 2. Intel CPUs have only ever supported FMA3.

Ryzen Zen1 / Zen+ chips do apparently decode and execute FMA4 instructions correctly, but don't report FMA4 support in CPUID. (There was one sketchy report of incorrect FMA4 results, but nobody else reproduced it. It seems to be just FUD rumours that circulated based on that one report, probably a software bug not a chip problem.)

However, Zen 2 does not support FMA4; illegal instruction (#UD) exception.

like image 198
Peter Cordes Avatar answered Apr 08 '23 22:04

Peter Cordes