With the constraint that I can use only SSE and SSE2 instructions, I have a need to replace the least significant (0) element of a 4 element __m128i vector with the 0 element from another vector.
For floating point vectors, the task is simple - one can use the _mm_move_ss() intrinsic to cause the element to be replaced with the 0 element from another vector. It generates one movss instruction, so is quite efficient.
Using two casting intrinsics, it's possible to also convince the compiler to use a single SSE movss instruction to move integer data. The source code ends up looking like this:
__m128i NewVector = _mm_castps_si128(_mm_move_ss(_mm_castsi128_ps(Take3FromThisVector),
_mm_castsi128_ps(Take1FromThisVector)));
It looks a bit messy, but with a suitable amount of commenting it can be acceptable, especially since it generates a minimum of instructions. In its typical use everything's optimized to be in xmm registers.
My question is this:
Since it's a movss instruction, where the "ss" implies single precision floating point, is it okay to have it move integer data that could potentially contain some "special" or "illegal" (for floating point) combo of bits in any of the vector positions?
The obvious alternative - which I also implemented and tested - is to AND the first vector with a mask, then OR in a second vector that contains just one value in the least significant element, with all the others being zero. As you can imagine, this generates more instructions.
I've tested the casting approach I showed above and it doesn't seem to cause any problems, but I note in particular that there's no intrinsic provided that does this same operation for integer data. It seems as though Intel would have provided one if it was just as good for integer data - e.g., _mm_move_epi32 or similar. And so I'm skeptical whether this is a good idea.
I did some searches, e.g., "can a movss instruction cause a floating point exception", but did not find any information that would answer my question.
Thanks in advance for knowledge you're willing to share.
-Noel
Yes, it's fine to use FP shuffles like movss xmm, xmm
on integer data. The insn reference manual tells you that it can't raise FP numeric exceptions; only actual FP math instructions do that. So go ahead and cast.
There isn't even a bypass delay for using FP shuffles on integer data in most uarches (but there is extra latency for using integer shuffles between FP math instructions).
Agner Fog's "optimizing assembly" guide has a great section on what instructions are useful for different kinds of data movement (broadcasts, merging, etc.) See also the x86 tag wiki for more good links.
The reason there's no integer intrinsic is that the SSE2 movd
integer instruction zeros the upper bytes of the destination, like movss
used as a load, but unlike movss
between registers.
Intel's vector instruction set known for its inconsistency and non-orthogonality, esp. the earliest versions (like SSE1). SSE4.1 filled many gaps, but there are still obvious missing pieces.
The types __m128
and __m128i
are interchangeable. The main reason for the cast is to make your intentions clearer (and keep your compiler happy). The cast itself does not generate any extra assembly.
The _mm_move_ss
operation is described directly in terms of which bits end up in your result.
If you end up with an invalid bit combination for single-precision floats, this will only be a problem if you try to use the resulting value in floating-point calculations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With