Best way to shuffle 64-bit portions of two __m128i's

Question

I have two __m128is, a and b, that I want to shuffle so that the upper 64 bits of a fall in the lower 64 bits of dst and the lower 64 bits of b fall in the upper 64 of dst. i.e.

dst[ 0:63]  = a[64:127]
dst[64:127] = b[0:63]

Equivalent to:

__m128i dst = _mm_unpacklo_epi64(_mm_srli_si128i(a, 8), b);

or

__m128i dst = _mm_castpd_si128(mm_shuffle_pd(_mm_castsi128_pd(a),_mm_castsi128_pd(b),1));

Is there a better way to do this than the first method? The second one is just one instruction, but the switch to the floating point SIMD execution is more costly than the extra instruction from the first.

Peter Cordes · Accepted Answer

Latency isn't always the worst thing ever. If it's not part of a loop-carried dep-chain, then just use the single instruction.

Also, there might not be any! Agner Fog's microarch doc says he found no extra latency in some cases when using the "wrong" type of shuffle or boolean, on Sandybridge. Blends still have the extra latency. On Haswell, he says there are no extra delays at all for mixing types of shuffle. (pg 140, Data Bypass Delays.)

So go ahead and use shufps, unless you care a lot about your code being fast on Nehalem. (Previous designs (merom/conroe, and Penryn) didn't have extra bypass delays for using the wrong move or shuffle.)

For AMD, shufps runs in the ivec domain, same as integer shuffles, so it's fine to use it. Like Intel, FP blends run in the FP domain, and thus have no bypass delay for FP data.

If you include multiple asm versions depending on which instruction sets are supported, without going completely nuts about having the optimal version of everything for every CPU like x264 does, you might use wrong-type ops in your version for AVX CPUs, but use multiple instructions in your non-AVX version. Nehalem has large penalties (2 cycle bypass delays for each domain transition), while Sandybridge is 0 or 1 cycle. SnB is the first generation with AVX.

Pre-Nehalem (no SSE4.2) is so old that it's probably not worth tuning a version specifically for it, even though it doesn't have any penalties for "wrong type" shuffles. Nehalem is right on the cusp of being kinda slow, so software running on those systems will have the hardest time operating in real-time, or not feeling slow. Thus, being bad on Nehalem would add to a bad user experience since their system is already not the fastest.

Best way to shuffle 64-bit portions of two __m128i's

Tags:

intel

simd

sse

intrinsics

Steve Cox

1 Answers

Peter Cordes

Recent Activity

Donate For Us

Best way to shuffle 64-bit portions of two __m128i's

Tags:

intel

simd

sse

intrinsics

Steve Cox

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us