Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Automatically generate FMA instructions in MSVC

MSVC supports AVX/AVX2 instructions for years now and according to this msdn blog post, it can automatically generate fused-multiply-add (FMA) instructions.

Yet neither of the following functions compile to FMA instruction:

float func1(float x, float y, float z)
{
    return x * y + z;
}

float func2(float x, float y, float z)
{
     return std::fma(x,y,z);
}

Even worse, std::fma is not implemented as a single FMA instruction, it performs terribly, much slower than a plain x * y + z (the poor performance of std::fma is expected if the implementation doesn't rely on FMA instruction).

I compile with /arch:AVX2 /O2 /Qvec flags. Also tried it with /fp:fast, no success.

So the question is how can MSVC forced to automatically emit FMA instructions?

UPDATE

There is a #pragma fp_contract (on|off), which (looks like) does nothing.

like image 758
plasmacel Avatar asked Dec 14 '15 11:12

plasmacel


2 Answers

I solved this long-standing problem.

As it turns out, flags /fp:fast, /arch:AVX2 and /O1 (or above /O1) are not enough for Visual Studio 2015 mode to emit FMA instructions in 32-bits mode. You also need the "Whole Program Optimization" turned on with flag /GL.

Then Visual Studio 2015 will generate an FMA instruction vfmadd213ss for

float func1(float x, float y, float z)
{
    return x * y + z;
}

Regarding std::fma, I opened a bug at Microsoft Connect. They confirmed the behavior that std::fma doesn't compile to FMA instructions, because the compiler doesn't treat it as an intrinsic. According to their response it will be fixed in a future update to get the best codegen possible.

like image 72
plasmacel Avatar answered Oct 31 '22 20:10

plasmacel


MSVC 2015 does generate an fma instruction for scalar operations but not for vector operations (unless you explicitly use an fma intrinsic).

I compiled the following code

//foo.cpp
float mul_add(float a, float b, float c) {
    return a*b + c;
}

//MSVC cannot handle vectors as function parameters so use const references
__m256 mul_addv(__m256 const &a, __m256 const &b, __m256 const &c) {
    return _mm256_add_ps(_mm256_mul_ps(a, b), c);
}

with

cl /c /O2 /arch:AVX2 /fp:fast /FA foo.cpp

in MSVC2015 and it produced the following assembly

;mul_add
vmovaps xmm3, xmm1
vfmadd213ss xmm3, xmm0, xmm2
vmovaps xmm0, xmm3

and

;mul_addv
vmovups ymm0, YMMWORD PTR [rcx]
vmulps  ymm1, ymm0, YMMWORD PTR [rdx]
vaddps  ymm0, ymm1, YMMWORD PTR [r8]
like image 3
Z boson Avatar answered Oct 31 '22 18:10

Z boson