I'm pretty new to intrinsics and i faced with different behavior of my code with GCC-7.4 and GCC-8.3
My code is pretty simple
b.cpp:
#include <iostream>
#include <xmmintrin.h>
void foo(const float num, const float denom)
{
const __v4sf num4 = {
num,
num,
num,
num,
};
const __v4sf denom4 = {
denom,
denom,
denom,
denom,
};
float res_arr[] = {0, 0, 0, 0};
__v4sf *res = (__v4sf*)res_arr;
*res = num4 / denom4;
std::cout << res_arr[0] << std::endl;
std::cout << res_arr[1] << std::endl;
std::cout << res_arr[2] << std::endl;
std::cout << res_arr[3] << std::endl;
}
In b.cpp we just basically construct two __v4sf
from float variables and performing division
b.h:
#ifndef B_H
#define B_H
void foo(const float num, const float denom);
#endif
a.cpp:
#include "b.h"
int main (void)
{
const float denominator = 1.0f;
const float numerator = 12.0f;
foo(numerator, denominator);
return 0;
}
Here we just call our function from b.cpp
GCC 7.4 works ok:
g++-7 -c b.cpp -o b.o && g++-7 a.cpp b.o -o a.out && ./a.out
12
12
12
12
But something wrong with GCC 8.3
g++-8 -c b.cpp -o b.o && g++-8 a.cpp b.o -o a.out && ./a.out
inf
inf
inf
inf
So my question is - why i receive different results with different versions of GCC? Is it undefined behavior?
Compiler intrinsics (sometimes called "builtins") are like the library functions you're used to, except they're built in to the compiler. They may be faster than regular library functions (the compiler knows more about them so it can optimize better) or handle a smaller input range than the library functions.
__builtin_* functions are optimised functions provided by the compiler libraries. These might be builtin versions of standard library functions, such as memcpy, and perhaps more typically some of the maths functions.
So if you ever need to check the version of the GCC C++ compiler that you have installed on your PC, you can do it through the command prompt by typing in the single line, g++ --version, and this will return the result.
You've found a bug in gcc8 and later, which happens with/without optimization enabled. Thanks for reporting it.
With optimization enabled it's easy to see what the asm is doing because the __v4sf
stuff optimizes away: it's just scalar division and printing the result 4 times. (Plus 4 calls to flush cout because you used std::endl
for some reason.)
gcc7 correctly optimizes it to divss xmm0, xmm1
to do num / denom
. Then it converts to double
because the output functions only take double
, not float
, passes that to iostream
functions. (GCC7 saves the double
bit-pattern in integer register r14
instead of memory, with -mtune=skylake
. GCC8 and later just use memory which probably makes more sense.)
gcc8 and later does divss xmm0, .LC0[rip]
where the constant from memory is 0
(the bit-pattern for +0.0
). So it's dividing the num
by zero, ignoring denom
.
Check it out on the Godbolt compiler explorer.
Using alignas(16) float res_arr[4];
to remove the potential under-alignment of the __v4sf *res
doesn't help. (You generally don't need __attribute__((aligned(16)))
anymore; C++11 introduced standard syntax for alignment.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With