SSE intrinsics includes _mm_shuffle_ps xmm1 xmm2 immx
which allows one to pick 2 elements from xmm1
concatenated with 2 elements from xmm2
. However this is for floats, (implied by the _ps , packed single). However if you cast your packed integers __m128i, then you can use _mm_shuffle_ps as well:
#include <iostream>
#include <immintrin.h>
#include <sstream>
using namespace std;
template <typename T>
std::string __m128i_toString(const __m128i var) {
std::stringstream sstr;
const T* values = (const T*) &var;
if (sizeof(T) == 1) {
for (unsigned int i = 0; i < sizeof(__m128i); i++) {
sstr << (int) values[i] << " ";
}
} else {
for (unsigned int i = 0; i < sizeof(__m128i) / sizeof(T); i++) {
sstr << values[i] << " ";
}
}
return sstr.str();
}
int main(){
cout << "Starting SSE test" << endl;
cout << "integer shuffle" << endl;
int A[] = {1, -2147483648, 3, 5};
int B[] = {4, 6, 7, 8};
__m128i pC;
__m128i* pA = (__m128i*) A;
__m128i* pB = (__m128i*) B;
*pA = (__m128i)_mm_shuffle_ps((__m128)*pA, (__m128)*pB, _MM_SHUFFLE(3, 2, 1 ,0));
pC = _mm_add_epi32(*pA,*pB);
cout << "A[0] = " << A[0] << endl;
cout << "A[1] = " << A[1] << endl;
cout << "A[2] = " << A[2] << endl;
cout << "A[3] = " << A[3] << endl;
cout << "B[0] = " << B[0] << endl;
cout << "B[1] = " << B[1] << endl;
cout << "B[2] = " << B[2] << endl;
cout << "B[3] = " << B[3] << endl;
cout << "pA = " << __m128i_toString<int>(*pA) << endl;
cout << "pC = " << __m128i_toString<int>(pC) << endl;
}
Snippet of relevant corresponding assembly (mac osx, macports gcc 4.8, -march=native on an ivybridge CPU):
vshufps $228, 16(%rsp), %xmm1, %xmm0
vpaddd 16(%rsp), %xmm0, %xmm2
vmovdqa %xmm0, 32(%rsp)
vmovaps %xmm0, (%rsp)
vmovdqa %xmm2, 16(%rsp)
call __ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
....
Thus it seemingly works fine on integers, which I expected as the registers are agnostic to types, however there must be a reason why the docs say that this instruction is only for floats. Does someone know any downsides, or implications I have missed?
There is no equivalent to _mm_shuffle_ps
for integers. To achieve the same effect in this case you can do
SSE2
*pA = _mm_shuffle_epi32(_mm_unpacklo_epi32(*pA, _mm_shuffle_epi32(*pB, 0xe)),0xd8);
SSE4.1
*pA = _mm_blend_epi16(*pA, *pB, 0xf0);
or change to the floating point domain like this
*pA = _mm_castps_si128(
_mm_shuffle_ps(_mm_castsi128_ps(*pA),
_mm_castsi128_ps(*pB), _MM_SHUFFLE(3, 2, 1 ,0)));
But changing domains may incur bypass latency delays on some CPUs. Keep in mind that according to Agner
The bypass delay is important in long dependency chains where latency is a bottleneck, but not where it is throughput rather than latency that matters.
You have to test your code and see which method above is more efficient.
Fortunately, on most Intel/AMD CPUs, there is usually no penalty for using shufps
between most integer-vector instructions. Agner says:
For example, I found no delay when mixing
PADDD
andSHUFPS
[on Sandybridge].
Nehalem does have 2 bypass-delay latency to/from SHUFPS
, but even then a single SHUFPS
is often still faster than multiple other instructions. Extra instructions have latency, too, as well as costing throughput.
The reverse (integer shuffles between FP math instructions) is not as safe:
In Agner Fog's microarchitecture on page 112 in Example 8.3a, he shows that using PSHUFD
(_mm_shuffle_epi32
) instead of SHUFPS
(_mm_shuffle_ps
) when in the floating point domain causes a bypass delay of four clock cycles. In Example 8.3b he uses SHUFPS to remove the delay (which works in his example).
On Nehalem there are actually five domains. Nahalem seems to be the most effected (the bypass delays did not exist before Nahalem). On Sandy Bridge the delays are less significant. This is even more true on Haswell. In fact on Haswell Agner said he found no delays between SHUFPS
or PSHUFD
(see page 140).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With