Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is there no fused multiply-add for general-purpose registers on x86_64 CPUs?

On Intel and AMD x86_64 processors, SIMD vectorized registers have specific fused-multiply-add capabilities, but general-purpose (scalar, integer) registers don't - you basically need to multiply, then add (unless you can fit things into an lea).

Why is that? I mean, is it that useless so as to not be worth the overhead?

like image 332
einpoklum Avatar asked Mar 13 '18 10:03

einpoklum


People also ask

Why does x86 have so few registers?

Adding more registers would no longer be x86, so you mean 'creating a new instruction set based on x86 with more registers'. Most compilers would not use these as they can just as well compile to x86 to also target a superset of x86. More registers means more expensive hardware.

How many instructions does x86_64 have?

states that the current x86-64 design “contains 981 unique mnemonics and a total of 3,684 instruction variants” [2]. However they do not specify which features are included in their count.

How many registers does x86 have?

The x86 architecture has 8 General-Purpose Registers (GPR), 6 Segment Registers, 1 Flags Register and an Instruction Pointer. 64-bit x86 has additional registers.


1 Answers

Integer multiply is common, but not one of the most common things to do with integers. But with floating point numbers, multiplying and adding is used all the time, and FMA provides major speedups for lots of ALU-bound FP code.

Also, floating point actually avoids precision loss with an FMA (the x*y internal temporary isn't rounded off at all before adding). This is why the ISO C99 / C++ fma() math library function exists, and why it's slow to implement without hardware FMA support.

Integer FMA (or multiply-accumulate, aka MAC) doesn't have any precision benefit vs. separate multiply and add.


Some non-x86 ISAs do provide integer FMA. It's not useless, but Intel and AMD both haven't bothered to include it until AVX512-IFMA (and that's still only for SIMD, basically exposing the 52-bit mantissa multiplier circuits needed for double-precision FMA/vmulpd for use by integer instructions).

Non-x86 examples include:

  • MIPS32, madd / maddu (unsigned) to multiply-accumulate into the hi / lo registers (the special registers used as a destination by regular multiply and divide instructions).

  • ARM smlal and friends (32x32=>64 bit MAC, or 16x16=>32 bit), also available for unsigned integer. Operands are regular R0..R15 general purpose registers.


An integer register FMA would be useful on x86, but uops that have 3 integer inputs are rare. CMOV and ADC have 3 inputs, but one of those is flags. Even then, they didn't decode to a single uop on Intel until Broadwell, after 3-input uop support was added for FP FMA in Haswell.

Haswell and later can track fused-domain uops with 3 integer inputs, though, for (some) micro-fused instructions with indexed addressing modes. Sandybridge/Ivybridge un-laminate instructions like add eax, [rdx+rcx]. (But Nehalem could keep them micro-fused, like Haswell; SnB simplified the fused-domain uop format). Anyway, that's fused domain, not in the scheduler. Only Broadwell/Skylake can track 3-input integer uops in the scheduler, and that's only for 2 integer + flags, not 3 integer registers.

Intel does use a "unified" scheduler, where FP and integer ops use the same scheduler, and it can track proper 3-input FP FMA. So IDK if there's a technical obstacle. If not, IDK why Intel didn't include integer FMA as part of BMI2 or something, which added stuff like mulx (2-input 2-output mul with mostly explicit operands, unlike legacy mul that uses rdx:rax.)


SSE2/SSSE3 does have integer mul-add instructions for vector registers, but only horizontal add after widening 16x16 => 32-bit (SSE2 pmaddwd) or (unsigned)8x(signed)8=>16-bit (SSSE3 pmaddubsw).

But those are only 2-input instructions, so even though there's a multiply and an add, it's very different from FMA.


Footnote: The question title originally said there was no FMA "for scalars". There is scalar FP FMA with the same FMA3 extension that added the packed versions of these: VFMADD231SD and friends operate on scalar double-precision, and the same flavours of vfmaddXXXss are available for scalar float in XMM registers.

like image 103
Peter Cordes Avatar answered Sep 30 '22 07:09

Peter Cordes