Why is it mandatory to use -ffast-math
with g++ to achieve the vectorization of loops using double
s? I don't like -ffast-math
because I don't want to lose precision.
-ffast-math tells the compiler to perform more aggressive floating-point optimizations. -ffast-math results in behavior that is not fully compliant with the ISO C or C++ standard. However, numerically robust floating-point programs are expected to behave correctly.
With the GCC compiler, the -ftree-vectorize option turns on auto-vectorization, and this flag is automatically set when using -O3 .
There are two ways to vectorize a loop computation in a C/C++ program. Programmers can use intrinsics inside the C/C++ source code to tell compilers to generate specific SIMD instructions so as to vectorize the loop computation. Or, compilers may be setup to vectorize the loop computation automatically.
Loop vectorization transforms procedural loops by assigning a processing unit to each pair of operands. Programs spend most of their time within such loops. Therefore, vectorization can significantly accelerate them, especially over large data sets.
You don’t necessarily lose precision with -ffast-math
. It only affects the handling of NaN
, Inf
etc. and the order in which operations are performed.
If you have a specific piece of code where you do not want GCC to reorder or simplify computations, you can mark variables as being used using an asm
statement.
For instance, the following code performs a rounding operation on f
. However, the two f += g
and f -= g
operations are likely to get optimised away by gcc:
static double moo(double f, double g)
{
g *= 4503599627370496.0; // 2 ** 52
f += g;
f -= g;
return f;
}
On x86_64, you can use this asm
statement to instruct GCC not to perform that optimisation:
static double moo(double f, double g)
{
g *= 4503599627370496.0; // 2 ** 52
f += g;
__asm__("" : "+x" (f));
f -= g;
return f;
}
You will need to adapt this for each architecture, unfortunately. On PowerPC, use +f
instead of +x
.
Very likely because vectorization means that you may have different results, or may mean that you miss floating point signals/exceptions.
If you're compiling for 32-bit x86 then gcc and g++ default to using the x87 for floating point math, on 64-bit they default to SSE, however the x87 can and will produce different values for the same computation so it's unlikely g++ will consider vectorizing if it can't guarantee that you will get the same results unless you use -ffast-math
or some of the flags it turns on.
Basically it comes down to the floating point environment for vectorized code may not be the same as the one for non vectorized code, sometimes in ways that are important, if the differences don't matter to you, something like
-fno-math-errno -fno-trapping-math -fno-signaling-nans -fno-rounding-math
but first look up those options and make sure that they won't affect your program's correctness. -ffinite-math-only
may help also
-ffast-math
enables operands reordering which allows many code to be vectorized.For example to calculate this
sum = a[0] + a[1] + a[2] + a[3] + a[4] + a[5] + … a[99]
the compiler is required to do the additions sequentially without -ffast-math
, because floating-point math is neither commutative nor associative.
That's the same reason why compilers can't optimize a*a*a*a*a*a
to (a*a*a)*(a*a*a)
without -ffast-math
That means no vectorization available unless you have very efficient horizontal vector adds.
However if -ffast-math
is enabled, the expression can be calculated like this (Look at A7. Auto-Vectorization
)
sum0 = a[0] + a[4] + a[ 8] + … a[96]
sum1 = a[1] + a[5] + a[ 9] + … a[97]
sum2 = a[2] + a[6] + a[10] + … a[98]
sum3 = a[3] + a[7] + a[11] + … a[99]
sum’ = sum0 + sum1 + sum2 + sum3
Now the compiler can vectorize it easily by adding each column in parallel and then do a horizontal add at the end
Does
sum’ == sum
? Only if(a[0]+a[4]+…) + (a[1]+a[5]+…) + (a[2]+a[6]+…) + ([a[3]+a[7]+…) == a[0] + a[1] + a[2] + …
This holds under associativity, which floats don’t adhere to, all of the time. Specifying/fp:fast
lets the compiler transform your code to run faster – up to 4 times faster, for this simple calculation.Do You Prefer Fast or Precise? - A7. Auto-Vectorization
It may be enabled by the -fassociative-math
flag in gcc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With