I'm pretty new to intrinsics and i faced with different behavior of my code with GCC-7.4 and GCC-8.3 My code is pretty simple b.cpp: <pre class="prettyprint"><code>#include <iostream> #include <xmmintrin.h> void foo(const float num, const float denom) { const __v4sf num4 = { num, num, num, num, }; const __v4sf denom4 = { denom, denom, denom, denom, }; float res_arr[] = {0, 0, 0, 0}; __v4sf *res = (__v4sf*)res_arr; *res = num4 / denom4; std::cout << res_arr[0] << std::endl; std::cout << res_arr[1] << std::endl; std::cout << res_arr[2] << std::endl; std::cout << res_arr[3] << std::endl; } </code></pre> In b.cpp we just basically construct two <code>__v4sf</code> from float variables and performing division b.h: <pre class="prettyprint"><code>#ifndef B_H #define B_H void foo(const float num, const float denom); #endif </code></pre> a.cpp: <pre class="prettyprint"><code>#include "b.h" int main (void) { const float denominator = 1.0f; const float numerator = 12.0f; foo(numerator, denominator); return 0; } </code></pre> Here we just call our function from b.cpp GCC 7.4 works ok: <pre class="prettyprint"><code>g++-7 -c b.cpp -o b.o && g++-7 a.cpp b.o -o a.out && ./a.out 12 12 12 12 </code></pre> But something wrong with GCC 8.3 <pre class="prettyprint"><code>g++-8 -c b.cpp -o b.o && g++-8 a.cpp b.o -o a.out && ./a.out inf inf inf inf </code></pre> So my question is - why i receive different results with different versions of GCC? Is it undefined behavior?

You've found a bug in gcc8 and later, which happens with/without optimization enabled. Thanks for reporting it. With optimization enabled it's easy to see what the asm is doing because the <code>__v4sf</code> stuff optimizes away: it's just scalar division and printing the result 4 times. (Plus 4 calls to flush cout because you used <code>std::endl</code> for some reason.) gcc7 correctly optimizes it to <code>divss xmm0, xmm1</code> to do <code>num / denom</code>. Then it converts to <code>double</code> because the output functions only take <code>double</code>, not <code>float</code>, passes that to <code>iostream</code> functions. (GCC7 saves the <code>double</code> bit-pattern in integer register <code>r14</code> instead of memory, with <code>-mtune=skylake</code>. GCC8 and later just use memory which probably makes more sense.) gcc8 and later does <code>divss xmm0, .LC0[rip]</code> where the constant from memory is <code>0</code> (the bit-pattern for <code>+0.0</code>). So it's dividing the <code>num</code> by zero, ignoring <code>denom</code>. Check it out on the Godbolt compiler explorer. Using <code>alignas(16) float res_arr[4];</code> to remove the potential under-alignment of the <code>__v4sf *res</code> doesn't help. (You generally don't need <code>__attribute__((aligned(16)))</code> anymore; C++11 introduced standard syntax for alignment.) <hr>

Different intrinsics behaviour depending on GCC version

Tags:

c++

gcc

undefined-behavior

intrinsics

I'm pretty new to intrinsics and i faced with different behavior of my code with GCC-7.4 and GCC-8.3

My code is pretty simple

b.cpp:

#include <iostream>
#include <xmmintrin.h>

void foo(const float num, const float denom)
{
    const __v4sf num4 = {
        num,
        num,
        num,
        num,
    };
    const __v4sf denom4 = {
        denom,
        denom,
        denom,
        denom,
    };
    float res_arr[] = {0, 0, 0, 0};

    __v4sf *res = (__v4sf*)res_arr;
    *res = num4 / denom4;
    std::cout << res_arr[0] << std::endl;
    std::cout << res_arr[1] << std::endl;
    std::cout << res_arr[2] << std::endl;
    std::cout << res_arr[3] << std::endl;
}

In b.cpp we just basically construct two __v4sf from float variables and performing division

b.h:

#ifndef B_H
#define B_H

void foo(const float num, const float denom);

#endif

a.cpp:

#include "b.h"

int main (void)
{
    const float denominator = 1.0f;
    const float numerator = 12.0f;
    foo(numerator, denominator);
    return 0;
}

Here we just call our function from b.cpp

GCC 7.4 works ok:

g++-7 -c b.cpp -o b.o && g++-7 a.cpp b.o -o a.out && ./a.out
12
12
12
12

But something wrong with GCC 8.3

g++-8 -c b.cpp -o b.o && g++-8 a.cpp b.o -o a.out && ./a.out
inf
inf
inf
inf

So my question is - why i receive different results with different versions of GCC? Is it undefined behavior?

986

asked Jun 10 '19 08:06

Daiver

1 Answers

You've found a bug in gcc8 and later, which happens with/without optimization enabled. Thanks for reporting it.

With optimization enabled it's easy to see what the asm is doing because the __v4sf stuff optimizes away: it's just scalar division and printing the result 4 times. (Plus 4 calls to flush cout because you used std::endl for some reason.)

gcc7 correctly optimizes it to divss xmm0, xmm1 to do num / denom. Then it converts to double because the output functions only take double, not float, passes that to iostream functions. (GCC7 saves the double bit-pattern in integer register r14 instead of memory, with -mtune=skylake. GCC8 and later just use memory which probably makes more sense.)

gcc8 and later does divss xmm0, .LC0[rip] where the constant from memory is 0 (the bit-pattern for +0.0). So it's dividing the num by zero, ignoring denom.

Check it out on the Godbolt compiler explorer.

Using alignas(16) float res_arr[4]; to remove the potential under-alignment of the __v4sf *res doesn't help. (You generally don't need __attribute__((aligned(16))) anymore; C++11 introduced standard syntax for alignment.)

197

answered Nov 01 '22 03:11

Peter Cordes

Related questions
                            
                                C++17 <functional> template parameter deductions not working on Xcode 10.1
                            
                                How to COPY library files between stages of a multi-stage Docker build while preserving symlinks?
                            
                                Is there a standard conforming way to write a portable ls utility in C++?
                            
                                How to hide the complex range type of a range-v3?
                            
                                Can I Initialize a char[] with a Ternary?
                            
                                Function overloading - order of definitions
                            
                                is_assignable and std::unique_ptr
                            
                                Is it beneficial anymore to unroll loops in C++ over fixed-sized arrays?
                            
                                Delegate constructor and default argument depending on other arguments
                            
                                Can a write to std::atomic go unseen by other threads while using std::atomic::compare_exchange_strong?
                            
                                GCC template deduction removes const bug?
                            
                                How to efficiently bind either an lvalue or rvalue to the same reference?
                            
                                Determine if there is an overloaded function defined for a parameter of a specific type
                            
                                Big O Notation and Time Complexity of C++ Code Snippet
                            
                                highlighting custom QWidgetAction on hover
                            
                                Sum of max elements in sub-triangles
                            
                                How effectively can function-local lambdas be inlined by C++ compilers?
                            
                                How to detect if a function exists?
                            
                                C++ coroutines: implementing task<void>
                            
                                Conditionally enable non-template function c++

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With