VS2019, Release, x86.
template <int i> float get() const {
int f = _mm_extract_ps(fmm, i);
return (float const&)f;
}
When use return (float&)f;
compiler uses
extractps m32, ...
movss xmm0, m32
.correct result
When use return (float const&)f;
compiler uses
extractps eax, ...
movd xmm0, eax
.wrong result
The main idea that T& and T const& is at first T then const. Const is just some kind of agreement for programmers. You know that you can get around it. But there is NO any const in assembly code, but type float IS. And I think that for both float& and float const& it MUST be float representation (cpu register) in assembly. We can use intermediate int reg32, but the final interpretation must be float.
And at this time it looks like regression, because this worked fine before. And also using float& in this case is definitely strange, because we shouldn't case about float const& safety but temp var for float& is really questionable.
Microsoft answered:
Hi Truthfinder, thanks for the self-contained repro. As it happens, this behavior is actually correct. As my colleague @Xiang Fan [MSFT] described in an internal email:
The conversions performed by [a c-style cast] tries the following sequence: (4.1) — a const_cast (7.6.1.11), (4.2) — a static_cast (7.6.1.9), (4.3) — a static_cast followed by a const_cast, (4.4) — a reinterpret_cast (7.6.1.10), or (4.5) — a reinterpret_cast followed by a const_cast,
If a conversion can be interpreted in more than one of the ways listed above, the interpretation that appears first in the list is used.
So in your case, (const float &) is converted to static_cast, which has the effect "the initializer expression is implicitly converted to a prvalue of type “cv1 T1”. The temporary materialization conversion is applied and the reference is bound to the result."
But in the other case, (float &) is converted to reinterpret_cast because static_cast isn’t valid, which is the same as reinterpret_cast(&operand).
The actual "bug" you're observing is that one cast does: "transform the float-typed value "1.0" into the equivalent int-typed value "1"", while the other cast says "find the bit representation of 1.0 as a float, and then interpret those bits as an int".
For this reason we recommend against c-style casts.
Thanks!
MS forum link: https://developercommunity.visualstudio.com/content/problem/411552/extract-ps-intrinsics-bug.html
Any ideas?
P.S. What do I really want:
float val = _mm_extract_ps(xmm, 3);
In manual assembly I can write: extractps val, xmm0, 3
where val is float 32 memory variable. Only ONE! instruction. I want see the same result in compiler generated assembly code. No shuffles or any other excessive instructions. The most bad acceptable case is: extractps reg32, xmm0, 3; mov val, reg32
.
My point about T& and T const&:
The type of variable must be the SAME for both cases. But now float&
will interpret m32 as float32 and float const&
will interpret m32 as int32.
int main() {
int z = 1;
float x = (float&)z;
float y = (float const&)z;
printf("%f %f %i", x, y, x==y);
return 0;
}
Out: 0.000000 1.000000 0
Is that really OK?
Best regards, Truthfinder
Type Punning In a systems language like C++ you often want to interpret a value of type A as a value of type B where A and B are completely unrelated types. This is called type punning.
The only safe manner of using type punning is with unsigned char or well unsigned char arrays (because we know that members of array objects are strictly contiguous and there is not any padding bytes when their size is computed with sizeof() ).
Most of the time, type punning won't cause any problems. It is considered undefined behavior by the C standard but will usually do the work you expect. That is unless you're trying to squeeze more performance out of your code through optimizations.
The strict aliasing rule dictates that pointers are assumed not to alias if they point to fundamentally different types, except for char* and void* which can alias to any other data type.
There's an interesting question about C++ cast semantics (which Microsoft already briefly answered for you), but it's mixed up with your misuse of _mm_extract_ps
resulting in needing a type-pun in the first place. (And only showing asm that is equivalent, omitting the int->float conversion.) If someone else wants to expand on the standard-ese in another answer, that would be great.
template <int i> float get(__m128 input) {
__m128 tmp = input;
if (i) // constexpr i means this branch is compile-time-only
tmp = _mm_shuffle_ps(tmp,tmp,i); // shuffle it to the bottom.
return _mm_cvtss_f32(tmp);
}
If you actually have a memory-destination use case, you should be looking at asm for a function that takes a float*
output arg, not a function that needs the result in xmm0
. (And yes, that is a use-case for the extractps
instruction, but arguably not the _mm_extract_ps
intrinsics. gcc and clang use extractps
when optimizing *out = get<2>(in)
, although MSVC misses that and still uses shufps + movss.)
Both blocks of asm you show are simply copying the low 32 bits of xmm0 somewhere, with no conversion to int. You left out the important different, and only showed the part that just uselessly copies the float
bit-pattern out of xmm0 and then back, in 2 different ways (to register or to memory). movd
is a pure copy of the bits unmodified, just like the movss load.
It's the compiler's choice which to use, after you force it to use extractps
at all. Going through a register and back is lower latency than store/reload, but more ALU uops.
The (float const&)
attempt to type-pun does include a conversion from FP to integer, which you didn't show. As if we needed any more reason to avoid pointer/reference casting for type-punning, this really does mean something different: (float const&)f takes the integer bit-pattern (from _mm_extract_ps
) as an int
and converts that to float
.
I put your code on the Godbolt compiler explorer to see what you left out.
float get1_with_extractps_const(__m128 fmm) {
int f = _mm_extract_ps(fmm, 1);
return (float const&)f;
}
;; from MSVC -O2 -Gv (vectorcall passes __m128 in xmm0)
float get1_with_extractps_const(__m128) PROC ; get1_with_extractps_const, COMDAT
extractps eax, xmm0, 1 ; copy the bit-pattern to eax
movd xmm0, eax ; these 2 insns are an alternative to pxor xmm0,xmm0 + cvtsi2ss xmm0,eax to avoid false deps and zero the upper elements
cvtdq2ps xmm0, xmm0 ; packed conversion is 1 uop
ret 0
GCC compiles it this way:
get1_with_extractps_const(float __vector(4)): # gcc8.2 -O3 -msse4
extractps eax, xmm0, 1
pxor xmm0, xmm0 ; cvtsi2ss has an output dependency so gcc always does this
cvtsi2ss xmm0, eax ; MSVC's way is probably better for float.
ret
Apparently MSVC does define the behaviour of pointer/reference casting for type-punning. Plain ISO C++ doesn't (strict aliasing UB), and neither do other compilers. Use memcpy
to type-pun, or a union (which GNU C and MSVC support in C++ as an extension). Of course in this case, type-punning the vector element you want to an integer and back is horrible.
Only for (float &)f
does gcc warn about the strict-aliasing violation. And GCC / clang agree with MSVC that only this version is a type-pun, not materializing a float
from an implicit conversion. C++ is weird!
float get1_with_extractps_nonconst(__m128 fmm) {
int f = _mm_extract_ps(fmm, 1);
return (float &)f;
}
<source>: In function 'float get_with_extractps_nonconst(__m128)':
<source>:21:21: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
return (float &)f;
^
gcc optimizes away the extractps
altogether.
# gcc8.2 -O3 -msse4
get1_with_extractps_nonconst(float __vector(4)):
shufps xmm0, xmm0, 85 ; 0x55 = broadcast element 1 to all elements
ret
Clang uses SSE3 movshdup
to copy element 1 to 0. (And element 3 to 2).
But MSVC doesn't, which is another reason to never use this:
float get1_with_extractps_nonconst(__m128) PROC
extractps DWORD PTR f$[rsp], xmm0, 1 ; store
movss xmm0, DWORD PTR f$[rsp] ; reload
ret 0
_mm_extract_ps
for thisBoth of your versions are horrible because this is not what _mm_extract_ps
or extractps
are for. Intel SSE: Why does `_mm_extract_ps` return `int` instead of `float`?
A float
in a register is the same thing as the low element of a vector. The high elements don't need to be zeroed. And if they did, you'd want to use insertps
which can do xmm,xmm and zero elements according to an immediate.
Use _mm_shuffle_ps
to bring the element you want to the low position of a register, and then it is a scalar float. (And you can tell a C++ compiler that with _mm_cvtss_f32
). This should compile to just shufps xmm0,xmm0,2
, without an extractps
or any mov
.
template <int i> float get() const {
__m128 tmp = fmm;
if (i) // i=0 means the element is already in place
tmp = _mm_shuffle_ps(tmp,tmp,i); // else shuffle it to the bottom.
return _mm_cvtss_f32(tmp);
}
(I skipped using _MM_SHUFFLE(0,0,0,i)
because that's equal to i
.)
If your fmm
was in memory, not a register, then hopefully compilers would optimize away the shuffle and just movss xmm0, [mem]
. MSVC 19.14 does manage to do that, at least for the function-arg on the stack case. I didn't test other compilers, but clang should probably manage to optimize away the _mm_shuffle_ps
; it's very good at seeing through shuffles.
e.g. a test-case with a non-class-member version of your function, and a caller that inlines it for a specific i
:
#include <immintrin.h>
template <int i> float get(__m128 input) {
__m128 tmp = input;
if (i) // i=0 means the element is already in place
tmp = _mm_shuffle_ps(tmp,tmp,i); // else shuffle it to the bottom.
return _mm_cvtss_f32(tmp);
}
// MSVC -Gv (vectorcall) passes arg in xmm0
// With plain dumb x64 fastcall, arg is on the stack, and it *does* just MOVSS load without shuffling
float get2(__m128 in) {
return get<2>(in);
}
From the Godbolt compiler explorer, asm output from MSVC, clang, and gcc:
;; MSVC -O2 -Gv
float get<2>(__m128) PROC ; get<2>, COMDAT
shufps xmm0, xmm0, 2
ret 0
float get<2>(__m128) ENDP ; get<2>
;; MSVC -O2 (without Gv, so the vector comes from memory)
input$ = 8
float get<2>(__m128) PROC ; get<2>, COMDAT
movss xmm0, DWORD PTR [rcx+8]
ret 0
float get<2>(__m128) ENDP ; get<2>
# gcc8.2 -O3 for x86-64 System V (arg in xmm0)
get2(float __vector(4)):
shufps xmm0, xmm0, 2 # with -msse4, we get unpckhps
ret
# clang7.0 -O3 for x86-64 System V (arg in xmm0)
get2(float __vector(4)):
unpckhpd xmm0, xmm0 # xmm0 = xmm0[1,1]
ret
clang's shuffle optimizer simplifies to unpckhpd
, which is faster on some old CPUs. Unfortunately it didn't notice it could have used movhlps xmm0,xmm0
, which is also fast and 1 byte shorter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With