That is, where an unfused multiply–add would compute the product b × c, round it to N significant bits, add the result to a, and round back to N significant bits, a fused multiply–add would compute the entire expression a + (b × c) to its full precision before rounding the final result down to N significant bits.
The primary benefit of FMA is that it can be twice as fast. Rather than take 1 cycle for the multiply and then 1 cycle for the add, the FPU can issue both operations in the same cycle.
The compiler is allowed to fuse a separated add and multiply, even though this changes the final result (by making it more accurate).
An FMA has only one rounding (it effectively keeps infinite precision for the internal temporary multiply result), while an ADD + MUL has two.
The IEEE and C standards allow this when #pragma STDC FP_CONTRACT ON
is in effect, and compilers are allowed to have it ON
by default (but not all do). Gcc contracts into FMA by default (with the default -std=gnu*
, but not -std=c*
, e.g. -std=c++14
). For Clang, it's only enabled with -ffp-contract=fast
. (With just the #pragma
enabled, only within a single expression like a+b*c
, not across separate C++ statements.).
This is different from strict vs. relaxed floating point (or in gcc terms, -ffast-math
vs. -fno-fast-math
) that would allow other kinds of optimizations that could increase the rounding error depending on the input values. This one is special because of the infinite precision of the FMA internal temporary; if there was any rounding at all in the internal temporary, this wouldn't be allowed in strict FP.
Even if you enable relaxed floating-point, the compiler might still choose not to fuse since it might expect you to know what you're doing if you're already using intrinsics.
So the best way to make sure you actually get the FMA instructions you want is you actually use the provided intrinsics for them:
FMA3 Intrinsics: (AVX2 - Intel Haswell)
_mm_fmadd_pd()
, _mm256_fmadd_pd()
_mm_fmadd_ps()
, _mm256_fmadd_ps()
FMA4 Intrinsics: (XOP - AMD Bulldozer)
_mm_macc_pd()
, _mm256_macc_pd()
_mm_macc_ps()
, _mm256_macc_ps()
I tested the following code in GCC 5.3, Clang 3.7, ICC 13.0.1 and MSVC 2015 (compiler version 19.00).
float mul_add(float a, float b, float c) {
return a*b + c;
}
__m256 mul_addv(__m256 a, __m256 b, __m256 c) {
return _mm256_add_ps(_mm256_mul_ps(a, b), c);
}
With the right compiler options (see below) every compiler will generate a vfmadd
instruction (e.g. vfmadd213ss
) from mul_add
. However, only MSVC fails to contract mul_addv
to a single vfmadd
instruction (e.g. vfmadd213ps
).
The following compiler options are sufficient to generate vfmadd
instructions (except with mul_addv
with MSVC).
GCC: -O2 -mavx2 -mfma
Clang: -O1 -mavx2 -mfma -ffp-contract=fast
ICC: -O1 -march=core-avx2
MSVC: /O1 /arch:AVX2 /fp:fast
GCC 4.9 will not contract mul_addv
to a single fma instruction but since at least GCC 5.1 it does. I don't know when the other compilers started doing this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With