Consider these two functions using SSE:
#include <xmmintrin.h>
int ftrunc1(float f) {
return _mm_cvttss_si32(_mm_set1_ps(f));
}
int ftrunc2(float f) {
return _mm_cvttss_si32(_mm_set_ss(f));
}
Both are exactly the same in behaviour for any input. But the assembler output is different:
ftrunc1:
pushl %ebp
movl %esp, %ebp
cvttss2si 8(%ebp), %eax
leave
ret
ftrunc2:
pushl %ebp
movl %esp, %ebp
movss 8(%ebp), %xmm0
cvttss2si %xmm0, %eax
leave
ret
That is, ftrunc2
uses one movss
instruction extra!
Is this normal? Does it matter? Should _mm_set1_ps
always be preferred over _mm_set_ss
when you only need to set the bottom element?
Compiler used was GCC 4.5.2 with -O3 -msse
.
_mm_set_ss
maps directly to an assembly instruction (movss
). But _mm_set1_ps
does not.
From what I've seen on GCC, MSVC, and ICC:
SSE intrinsics that map one-to-one to an assembly instruction are generally treated "as-is" - a black box. So the compiler will only optimizations that apply to the entire instruction itself. But it will not attempt to do any optimizations that require dataflow/dependency analysis on the individual vector elements.
The _mm_set1_ps
and _mm_set_ps
intrinsics do not map to a single instruction and have special case handling by most compilers. From what I've seen, all three of the compilers I've listed above do attempt to perform dataflow analysis optimizations on the individual elements.
When you put it all together, the second example leaves the movss
because the compiler doesn't realize that the top 3 elements don't matter. (It makes no attempt to "open up" the _mm_set_ss
intrinsic.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With