Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do the shuffle/permute intrinsics work for 256 bit pd?

I'm trying to wrap my head around how the _mm256_shuffle_pd and _mm256_permute_pd intrinsics work. I can't seem to predict what the results of one of these operations would be.

First, for _mm_shuffle_ps all is good. The results I get are the one I expect. For example:

float b[4] = { 1.12, 2.22, 3.33, 4.44 };

__m128 a = _mm_load_ps(&b[0]);
a = _mm_shuffle_ps(a, a, _MM_SHUFFLE(3, 0, 1, 2));
_mm_store_ps(&b[0], a);
// 3.33 2.22 1.12 4.44

So everything is right here. Now I wanted to try this with __m256d that is what I'm currently using in my code. From what I've found the _mm256_shuffle_ps/pd intrinsics works differently.

My understanding here is that the control mask is applied two times. The first time on the first half of the 128 bit and the second on the last 128 bit. The first two pairs of control bits are used to choose from the first vector ( and store the values in the first&second word and in the fifth&sixth word of the result vector ) while the highest bit pairs choose from the second vector. For example:

float b[8] = { 1.12, 2.22, 3.33, 4.44, 5.55, 6.66, 7.77, 8.88 };

__m256 a = _mm256_load_ps(&b[0]);
a = _mm256_shuffle_ps(a, a, 0b00000111);
_mm256_store_ps(&b[0], a);
// 4.44 2.22 1.12 1.12 8.88 6.66 5.55 5.55

Here the result I expect ( and I actually get ) is { 4.44, 2.22, 1.12, 1.12, 8.88, 6.66, 5.55, 5.55 }

This should work as follows:

enter image description here

( Sorry I'm bad at drawing ). And the same is done for the second vector ( in this case a again ) using the highest two pairs ( so 00 00 ) and filling the missing spaces.

I thought that _mm256_shuffle_pd would work the same way. So if I wanted the first double I would have to move the 00 space and the 01 space to construct it correctly.

For example:

__m256d a = _mm256_load_pd(&b[0]);
a = _mm256_shuffle_pd(a, a, 0b01000100);
_mm256_store_pd(&b[0], a);
// 1.12 1.12 4.44 3.33

I would have expected this to output { 1.12, 1.12, 3.33, 3.33 }. In my head, I'm taking 00 01 ( 1.12 ) and 00 01 { 3.33 } from the first vector and the same from the second with it being the same vector and all.

I've tried many combinations for the control mask and I just can't wrap my head around how this is used nor was I able to find somewhere where it was explained in a way I would understand.

So my question is: How does _mm256_shuffle_pd work? And how would I get the same result as _mm_shuffle_ps(a, a, _MM_SHUFFLE(3, 0, 2, 1)) with four doubles and a shuffle ( if at all possible)?

like image 850
lds Avatar asked Aug 08 '18 12:08

lds


1 Answers

shufps needs all 8 bits of its immediate just for 4 elements with 4 possible sources each. So it has no room to grow for 256-bit, and the only option was to replicate the same shuffle in both lanes.

But 128-bit shufpd only has 2 elements with 2 sources each, thus 2 x 1 bit. So the AVX version uses 4 bits total, 2 for each lane. (It's not lane-crossing, so it's not as powerful as 128-bit shufps.)


http://felixcloutier.com/x86/SHUFPD.html has full docs with a diagram, and detailed pseudocode. Intel's intrinsics guide for _mm256_shuffle_pd has the same pseudo-code.

AVX2 http://felixcloutier.com/x86/VPERMPD.html (_mm256_permute_pd, aka _mm256_permute4x64_pd) is lane-crossing, and uses its immediate exactly the way 128-bit shufps does: four 2-bit selectors.


The only lane-crossing 2-source shuffle is vperm2f128 (_mm256_permute2f128_pd), until AVX512F introduces finer granularity vpermt2pd and vpermt2ps (and equivalent integer shuffles.

AVX1 doesn't have any lane-crossing shuffles with granularity smaller than 128-bit, not even 1-source versions. If you need one, you have to build it out of vinsertf128 or vperm2f128 + in-lane shuffles.


Thus, keeping 3D vectors in SIMD vectors is even worse with AVX than it is for float with 128-bit vectors. http://fastcpp.blogspot.com/2011/04/vector-cross-product-using-sse-code.html might be faster than scalar, but it's much worse than you can do if you design your data layout for SIMD.

Use separate arrays of contiguous x[], y[], and z[] so you can do 4x cross products in parallel with no shuffling, and take advantage of FMA instructions. Use SIMD to do multiple vectors in parallel, not to speed up single vectors.

See links in https://stackoverflow.com/tags/sse/info, especially https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/ which explains the data-layout issue quite well, and which level of a loop to vectorize with SIMD.

like image 176
Peter Cordes Avatar answered Nov 07 '22 20:11

Peter Cordes