How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

People also ask

How does fused multiply add work?

That is, where an unfused multiply–add would compute the product b × c, round it to N significant bits, add the result to a, and round back to N significant bits, a fused multiply–add would compute the entire expression a + (b × c) to its full precision before rounding the final result down to N significant bits.

What is the advantage of a fused multiply add?

The primary benefit of FMA is that it can be twice as fast. Rather than take 1 cycle for the multiply and then 1 cycle for the add, the FPU can issue both operations in the same cycle.

The compiler is allowed to fuse a separated add and multiply, even though this changes the final result (by making it more accurate).

An FMA has only one rounding (it effectively keeps infinite precision for the internal temporary multiply result), while an ADD + MUL has two.

The IEEE and C standards allow this when #pragma STDC FP_CONTRACT ON is in effect, and compilers are allowed to have it ON by default (but not all do). Gcc contracts into FMA by default (with the default -std=gnu*, but not -std=c*, e.g. -std=c++14). For Clang, it's only enabled with -ffp-contract=fast. (With just the #pragma enabled, only within a single expression like a+b*c, not across separate C++ statements.).

This is different from strict vs. relaxed floating point (or in gcc terms, -ffast-math vs. -fno-fast-math) that would allow other kinds of optimizations that could increase the rounding error depending on the input values. This one is special because of the infinite precision of the FMA internal temporary; if there was any rounding at all in the internal temporary, this wouldn't be allowed in strict FP.

Even if you enable relaxed floating-point, the compiler might still choose not to fuse since it might expect you to know what you're doing if you're already using intrinsics.

So the best way to make sure you actually get the FMA instructions you want is you actually use the provided intrinsics for them:

FMA3 Intrinsics: (AVX2 - Intel Haswell)

_mm_fmadd_pd(), _mm256_fmadd_pd()
_mm_fmadd_ps(), _mm256_fmadd_ps()
and about a gazillion other variations...

FMA4 Intrinsics: (XOP - AMD Bulldozer)

_mm_macc_pd(), _mm256_macc_pd()
_mm_macc_ps(), _mm256_macc_ps()
and about a gazillion other variations...

I tested the following code in GCC 5.3, Clang 3.7, ICC 13.0.1 and MSVC 2015 (compiler version 19.00).

float mul_add(float a, float b, float c) {
    return a*b + c;
}

__m256 mul_addv(__m256 a, __m256 b, __m256 c) {
    return _mm256_add_ps(_mm256_mul_ps(a, b), c);
}

With the right compiler options (see below) every compiler will generate a vfmadd instruction (e.g. vfmadd213ss) from mul_add. However, only MSVC fails to contract mul_addv to a single vfmadd instruction (e.g. vfmadd213ps).

The following compiler options are sufficient to generate vfmadd instructions (except with mul_addv with MSVC).

GCC:   -O2 -mavx2 -mfma
Clang: -O1 -mavx2 -mfma -ffp-contract=fast
ICC:   -O1 -march=core-avx2
MSVC:  /O1 /arch:AVX2 /fp:fast

GCC 4.9 will not contract mul_addv to a single fma instruction but since at least GCC 5.1 it does. I don't know when the other compilers started doing this.

Related questions
                            
                                how to make negative numbers into positive
                            
                                How to wrap printf() into a function or macro?
                            
                                Why is "if (i++ && (i == 1))" false where i is an int holding the value 1?
                            
                                Memory leak C++
                            
                                How can I perform multiplication without the '*' operator?
                            
                                Hosting multiple clients with freemodbus
                            
                                Must R Packages Unload Dynamic Libraries When They Unload?
                            
                                Is there a useful case using a switch statement without braces?
                            
                                How to design a C / C++ library to be usable in many client languages? [closed]
                            
                                What is the difference? clang++ | clang -std=c++11
                            
                                difference between "ifndef" and "if !defined" in C?
                            
                                Does sscanf("123456789123456789123456789", "%d", &n) have defined behavior?
                            
                                Is "typedef" in between the type and the alias standard-conformant?
                            
                                How does function ACTUALLY return struct variable in C?
                            
                                What is the purpose of a zero length array in a struct? [duplicate]
                            
                                What is the value of '\n' under C compilers for old Mac OS?
                            
                                error: pasting "." and "red" does not give a valid preprocessing token
                            
                                What's the graceful way of handling out of memory situations in C/C++?
                            
                                What's the proper use of printf to display pointers padded with 0s
                            
                                Why am I getting the message "Single-stepping until exit . . . which has no line number information" in GDB?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

Tags:

cpu-architecture

c

avx

sse

fma

People also ask

Recent Activity

Donate For Us