Type punning with (float&)int works, (float const&)int converts like (float)int instead?

Tags:

VS2019, Release, x86.

template <int i> float get() const {
    int f = _mm_extract_ps(fmm, i);
    return (float const&)f;
}

When use return (float&)f; compiler uses

extractps m32, ...
movss xmm0, m32

.correct result

When use return (float const&)f; compiler uses

extractps eax, ...
movd xmm0, eax

.wrong result

The main idea that T& and T const& is at first T then const. Const is just some kind of agreement for programmers. You know that you can get around it. But there is NO any const in assembly code, but type float IS. And I think that for both float& and float const& it MUST be float representation (cpu register) in assembly. We can use intermediate int reg32, but the final interpretation must be float.

And at this time it looks like regression, because this worked fine before. And also using float& in this case is definitely strange, because we shouldn't case about float const& safety but temp var for float& is really questionable.

Microsoft answered:

Hi Truthfinder, thanks for the self-contained repro. As it happens, this behavior is actually correct. As my colleague @Xiang Fan [MSFT] described in an internal email:

The conversions performed by [a c-style cast] tries the following sequence: (4.1) — a const_cast (7.6.1.11), (4.2) — a static_cast (7.6.1.9), (4.3) — a static_cast followed by a const_cast, (4.4) — a reinterpret_cast (7.6.1.10), or (4.5) — a reinterpret_cast followed by a const_cast,

If a conversion can be interpreted in more than one of the ways listed above, the interpretation that appears first in the list is used.

So in your case, (const float &) is converted to static_cast, which has the effect "the initializer expression is implicitly converted to a prvalue of type “cv1 T1”. The temporary materialization conversion is applied and the reference is bound to the result."

But in the other case, (float &) is converted to reinterpret_cast because static_cast isn’t valid, which is the same as reinterpret_cast(&operand).

The actual "bug" you're observing is that one cast does: "transform the float-typed value "1.0" into the equivalent int-typed value "1"", while the other cast says "find the bit representation of 1.0 as a float, and then interpret those bits as an int".

For this reason we recommend against c-style casts.

Thanks!

MS forum link: https://developercommunity.visualstudio.com/content/problem/411552/extract-ps-intrinsics-bug.html

Any ideas?

P.S. What do I really want:

float val = _mm_extract_ps(xmm, 3);

In manual assembly I can write: extractps val, xmm0, 3 where val is float 32 memory variable. Only ONE! instruction. I want see the same result in compiler generated assembly code. No shuffles or any other excessive instructions. The most bad acceptable case is: extractps reg32, xmm0, 3; mov val, reg32.

My point about T& and T const&: The type of variable must be the SAME for both cases. But now float& will interpret m32 as float32 and float const& will interpret m32 as int32.

int main() {
    int z = 1;
    float x = (float&)z;
    float y = (float const&)z;
    printf("%f %f %i", x, y, x==y);
    return 0;
}

Out: 0.000000 1.000000 0

Is that really OK?

Best regards, Truthfinder

825

asked Jan 31 '19 06:01

truthfinder

1 Answers

There's an interesting question about C++ cast semantics (which Microsoft already briefly answered for you), but it's mixed up with your misuse of _mm_extract_ps resulting in needing a type-pun in the first place. (And only showing asm that is equivalent, omitting the int->float conversion.) If someone else wants to expand on the standard-ese in another answer, that would be great.

TL:DR: use this instead: it's zero or one shufps. No extractps, no type punning.

template <int i> float get(__m128 input) {
    __m128 tmp = input;
    if (i)     // constexpr i means this branch is compile-time-only
        tmp = _mm_shuffle_ps(tmp,tmp,i);  // shuffle it to the bottom.
    return _mm_cvtss_f32(tmp);
}

If you actually have a memory-destination use case, you should be looking at asm for a function that takes a float* output arg, not a function that needs the result in xmm0. (And yes, that is a use-case for the extractps instruction, but arguably not the _mm_extract_ps intrinsics. gcc and clang use extractps when optimizing *out = get<2>(in), although MSVC misses that and still uses shufps + movss.)

Both blocks of asm you show are simply copying the low 32 bits of xmm0 somewhere, with no conversion to int. You left out the important different, and only showed the part that just uselessly copies the float bit-pattern out of xmm0 and then back, in 2 different ways (to register or to memory). movd is a pure copy of the bits unmodified, just like the movss load.

It's the compiler's choice which to use, after you force it to use extractps at all. Going through a register and back is lower latency than store/reload, but more ALU uops.

The (float const&) attempt to type-pun does include a conversion from FP to integer, which you didn't show. As if we needed any more reason to avoid pointer/reference casting for type-punning, this really does mean something different: (float const&)f takes the integer bit-pattern (from _mm_extract_ps) as an int and converts that to float.

I put your code on the Godbolt compiler explorer to see what you left out.

float get1_with_extractps_const(__m128 fmm) {
    int f = _mm_extract_ps(fmm, 1);
    return (float const&)f;
}

;; from MSVC -O2 -Gv  (vectorcall passes __m128 in xmm0)
float get1_with_extractps_const(__m128) PROC   ; get1_with_extractps_const, COMDAT
    extractps eax, xmm0, 1   ; copy the bit-pattern to eax

    movd    xmm0, eax      ; these 2 insns are an alternative to pxor xmm0,xmm0 + cvtsi2ss xmm0,eax to avoid false deps and zero the upper elements
    cvtdq2ps xmm0, xmm0    ; packed conversion is 1 uop
    ret     0

GCC compiles it this way:

get1_with_extractps_const(float __vector(4)):    # gcc8.2 -O3 -msse4
        extractps       eax, xmm0, 1
        pxor    xmm0, xmm0            ; cvtsi2ss has an output dependency so gcc always does this
        cvtsi2ss        xmm0, eax     ; MSVC's way is probably better for float.
        ret

Apparently MSVC does define the behaviour of pointer/reference casting for type-punning. Plain ISO C++ doesn't (strict aliasing UB), and neither do other compilers. Use memcpy to type-pun, or a union (which GNU C and MSVC support in C++ as an extension). Of course in this case, type-punning the vector element you want to an integer and back is horrible.

Only for (float &)f does gcc warn about the strict-aliasing violation. And GCC / clang agree with MSVC that only this version is a type-pun, not materializing a float from an implicit conversion. C++ is weird!

float get1_with_extractps_nonconst(__m128 fmm) {
    int f = _mm_extract_ps(fmm, 1);
    return (float &)f;
}

<source>: In function 'float get_with_extractps_nonconst(__m128)':
<source>:21:21: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
     return (float &)f;
                     ^

gcc optimizes away the extractps altogether.

# gcc8.2 -O3 -msse4
get1_with_extractps_nonconst(float __vector(4)):
    shufps  xmm0, xmm0, 85    ; 0x55 = broadcast element 1 to all elements
    ret

Clang uses SSE3 movshdup to copy element 1 to 0. (And element 3 to 2). But MSVC doesn't, which is another reason to never use this:

float get1_with_extractps_nonconst(__m128) PROC
    extractps DWORD PTR f$[rsp], xmm0, 1     ; store
    movss   xmm0, DWORD PTR f$[rsp]          ; reload
    ret     0

Don't use `_mm_extract_ps` for this

Both of your versions are horrible because this is not what _mm_extract_ps or extractps are for. Intel SSE: Why does `_mm_extract_ps` return `int` instead of `float`?

A float in a register is the same thing as the low element of a vector. The high elements don't need to be zeroed. And if they did, you'd want to use insertps which can do xmm,xmm and zero elements according to an immediate.

Use _mm_shuffle_ps to bring the element you want to the low position of a register, and then it is a scalar float. (And you can tell a C++ compiler that with _mm_cvtss_f32). This should compile to just shufps xmm0,xmm0,2, without an extractps or any mov.

template <int i> float get() const {
    __m128 tmp = fmm;
    if (i)                               // i=0 means the element is already in place
        tmp = _mm_shuffle_ps(tmp,tmp,i);  // else shuffle it to the bottom.
    return _mm_cvtss_f32(tmp);
}

(I skipped using _MM_SHUFFLE(0,0,0,i) because that's equal to i.)

If your fmm was in memory, not a register, then hopefully compilers would optimize away the shuffle and just movss xmm0, [mem]. MSVC 19.14 does manage to do that, at least for the function-arg on the stack case. I didn't test other compilers, but clang should probably manage to optimize away the _mm_shuffle_ps; it's very good at seeing through shuffles.

Test-case proving this compiles efficiently

e.g. a test-case with a non-class-member version of your function, and a caller that inlines it for a specific i:

#include <immintrin.h>

template <int i> float get(__m128 input) {
    __m128 tmp = input;
    if (i)                  // i=0 means the element is already in place
        tmp = _mm_shuffle_ps(tmp,tmp,i);  // else shuffle it to the bottom.
    return _mm_cvtss_f32(tmp);
}

// MSVC -Gv (vectorcall) passes arg in xmm0
// With plain dumb x64 fastcall, arg is on the stack, and it *does* just MOVSS load without shuffling
float get2(__m128 in) {
    return get<2>(in);
}

From the Godbolt compiler explorer, asm output from MSVC, clang, and gcc:

;; MSVC -O2 -Gv
float get<2>(__m128) PROC               ; get<2>, COMDAT
        shufps  xmm0, xmm0, 2
        ret     0
float get<2>(__m128) ENDP               ; get<2>

;; MSVC -O2  (without Gv, so the vector comes from memory)
input$ = 8
float get<2>(__m128) PROC               ; get<2>, COMDAT
        movss   xmm0, DWORD PTR [rcx+8]
        ret     0
float get<2>(__m128) ENDP               ; get<2>

# gcc8.2 -O3 for x86-64 System V (arg in xmm0)
get2(float __vector(4)):
        shufps  xmm0, xmm0, 2   # with -msse4, we get unpckhps
        ret

# clang7.0 -O3 for x86-64 System V (arg in xmm0)
get2(float __vector(4)):
        unpckhpd        xmm0, xmm0      # xmm0 = xmm0[1,1]
        ret

clang's shuffle optimizer simplifies to unpckhpd, which is faster on some old CPUs. Unfortunately it didn't notice it could have used movhlps xmm0,xmm0, which is also fast and 1 byte shorter.

172

answered Oct 15 '22 18:10

Peter Cordes

Related questions
                            
                                How to run an executable on Heroku from node, works locally
                            
                                C++ structure initialization with all zeros
                            
                                Android Studio 2.2.2 LLDB 2.2 update issue
                            
                                initializing a vector class member in C++
                            
                                Memory efficient std::map alternative
                            
                                Is difference between two pointers legal c++17 constant expression?
                            
                                How to specify the compiler for CMAKE external project?
                            
                                Will the C++17 standard include "std::byte"?
                            
                                Correct way to allocate memory to std::shared_ptr
                            
                                Using enable_if with struct specialization
                            
                                Is it possible to set the value of elements in a constexpr array after declaration?
                            
                                POD members default initialization without braces
                            
                                How to create QList from std::vector
                            
                                Create an object without a name in C++
                            
                                What is a good design to add an "all" option to an enum in C++?
                            
                                Undefined reference to cv::imread(cv::String const&, int) [duplicate]
                            
                                Familiar template syntax for generic lambdas
                            
                                Different behavior when `std::lock_guard<std::mutex>` object has no name
                            
                                Why does this call to arrow (->) operator fail?
                            
                                How to add #define in Makefile?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Type punning with (float&)int works, (float const&)int converts like (float)int instead?

Tags:

c++

assembly

visual-c++

sse

intrinsics

truthfinder

People also ask

1 Answers

TL:DR: use this instead: it's zero or one shufps. No extractps, no type punning.

Don't use `_mm_extract_ps` for this

Test-case proving this compiles efficiently

Peter Cordes

Recent Activity

Donate For Us

Type punning with (float&)int works, (float const&)int converts like (float)int instead?

Tags:

c++

assembly

visual-c++

sse

intrinsics

truthfinder

People also ask

1 Answers

TL:DR: use this instead: it's zero or one shufps. No extractps, no type punning.

Don't use _mm_extract_ps for this

Test-case proving this compiles efficiently

Peter Cordes

Related questions

Recent Activity

Donate For Us

Don't use `_mm_extract_ps` for this