Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

implications of using _mm_shuffle_ps on integer vector

Tags:

avx

sse

SSE intrinsics includes _mm_shuffle_ps xmm1 xmm2 immx which allows one to pick 2 elements from xmm1 concatenated with 2 elements from xmm2. However this is for floats, (implied by the _ps , packed single). However if you cast your packed integers __m128i, then you can use _mm_shuffle_ps as well:

#include <iostream>
#include <immintrin.h>
#include <sstream>

using namespace std;

template <typename T>
std::string __m128i_toString(const __m128i var) {
    std::stringstream sstr;
    const T* values = (const T*) &var;
    if (sizeof(T) == 1) {
        for (unsigned int i = 0; i < sizeof(__m128i); i++) {
            sstr << (int) values[i] << " ";
        }
    } else {
        for (unsigned int i = 0; i < sizeof(__m128i) / sizeof(T); i++) {
            sstr << values[i] << " ";
        }
    }
    return sstr.str();
}



int main(){

  cout << "Starting SSE test" << endl;
  cout << "integer shuffle" << endl;

 int A[] = {1,  -2147483648, 3, 5};
 int B[] = {4, 6, 7, 8};

  __m128i pC;

  __m128i* pA = (__m128i*) A;
  __m128i* pB = (__m128i*) B;

  *pA = (__m128i)_mm_shuffle_ps((__m128)*pA, (__m128)*pB, _MM_SHUFFLE(3, 2, 1 ,0));
  pC = _mm_add_epi32(*pA,*pB);

  cout << "A[0] = " << A[0] << endl;
  cout << "A[1] = " << A[1] << endl;
  cout << "A[2] = " << A[2] << endl;
  cout << "A[3] = " << A[3] << endl;

  cout << "B[0] = " << B[0] << endl;
  cout << "B[1] = " << B[1] << endl;
  cout << "B[2] = " << B[2] << endl;
  cout << "B[3] = " << B[3] << endl;

  cout << "pA = " << __m128i_toString<int>(*pA) << endl;
  cout << "pC = " << __m128i_toString<int>(pC) << endl;
}

Snippet of relevant corresponding assembly (mac osx, macports gcc 4.8, -march=native on an ivybridge CPU):

vshufps $228, 16(%rsp), %xmm1, %xmm0
vpaddd  16(%rsp), %xmm0, %xmm2
vmovdqa %xmm0, 32(%rsp)
vmovaps %xmm0, (%rsp)
vmovdqa %xmm2, 16(%rsp)
call    __ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
....

Thus it seemingly works fine on integers, which I expected as the registers are agnostic to types, however there must be a reason why the docs say that this instruction is only for floats. Does someone know any downsides, or implications I have missed?

like image 429
hbogert Avatar asked Nov 17 '14 22:11

hbogert


1 Answers

There is no equivalent to _mm_shuffle_ps for integers. To achieve the same effect in this case you can do

SSE2

*pA = _mm_shuffle_epi32(_mm_unpacklo_epi32(*pA, _mm_shuffle_epi32(*pB, 0xe)),0xd8);

SSE4.1

*pA = _mm_blend_epi16(*pA, *pB, 0xf0);

or change to the floating point domain like this

*pA = _mm_castps_si128( 
        _mm_shuffle_ps(_mm_castsi128_ps(*pA), 
                       _mm_castsi128_ps(*pB), _MM_SHUFFLE(3, 2, 1 ,0)));

But changing domains may incur bypass latency delays on some CPUs. Keep in mind that according to Agner

The bypass delay is important in long dependency chains where latency is a bottleneck, but not where it is throughput rather than latency that matters.

You have to test your code and see which method above is more efficient.

Fortunately, on most Intel/AMD CPUs, there is usually no penalty for using shufps between most integer-vector instructions. Agner says:

For example, I found no delay when mixing PADDD and SHUFPS [on Sandybridge].

Nehalem does have 2 bypass-delay latency to/from SHUFPS, but even then a single SHUFPS is often still faster than multiple other instructions. Extra instructions have latency, too, as well as costing throughput.


The reverse (integer shuffles between FP math instructions) is not as safe:

In Agner Fog's microarchitecture on page 112 in Example 8.3a, he shows that using PSHUFD (_mm_shuffle_epi32) instead of SHUFPS (_mm_shuffle_ps) when in the floating point domain causes a bypass delay of four clock cycles. In Example 8.3b he uses SHUFPS to remove the delay (which works in his example).

On Nehalem there are actually five domains. Nahalem seems to be the most effected (the bypass delays did not exist before Nahalem). On Sandy Bridge the delays are less significant. This is even more true on Haswell. In fact on Haswell Agner said he found no delays between SHUFPS or PSHUFD (see page 140).

like image 188
Z boson Avatar answered Oct 11 '22 10:10

Z boson