I'm trying to find a more efficient way to "rotate" or shift the 32 bit floating point values within an avx _m256 vector to the right or left by one place.
Such that:
a7, a6, a5, a4, a3, a2, a1, a0
becomes
0, a7, a6, a5, a4, a3, a2, a1
(I dont mind if the data gets lost as I replace the cell anyway.)
I've already taken a look at this thread: Emulating shifts on 32 bytes with AVX but I don't really understand what is going on, and it doesn't explained what the _MM_SHUFFLE(0, 0, 3, 0) does as an input parameter.
I'm trying to optimise this code:
_mm256_store_ps(temp, array[POS(ii, jj)]);
_mm256_store_ps(left, array[POS(ii, jj-1)]);
tmp_array[POS(ii, jj)] = _mm256_set_ps(left[0], temp[7], temp[6], temp[5], temp[4], temp[3], temp[2], temp[1]);
I know once a shift is in place, I can use an insert to replace the remaining cell. I feel this will be more efficient then unpacking into a float[8] array and reconstructing.
-- I'd also like to be able to shift both left and right, as I need to perform a similar operation elsewhere.
Any help is greatly appreciated! Thanks!
For AVX2:
Use VPERMPS to do it in one lane-crossing shuffle instruction.
rotated_right = _mm256_permutevar8x32_ps(src, _mm256_set_epi32(0,7,6,5,4,3,2,1));
For AVX (without AVX2)
Since you say the data is coming from memory already, this might be good:
set1()
is cheap), although it does need the shuffle port on Intel SnB and IvB (the only two Intel microarchitectures with AVX but not AVX2). (See perf links in the x86 tag wiki.INSERTPS only work on XMM destinations, and can't reach the upper lane.
You could maybe use VINSERTF128 to do another unaligned load that ends up putting the element you want as the high element in the upper lane (with some don't-care vector in the low lane).
This compiles, but isn't tested.
__m256 load_rotr(float *src)
{
#ifdef __AVX2__
__m256 orig = _mm256_loadu_ps(src);
__m256 rotated_right = _mm256_permutevar8x32_ps(orig, _mm256_set_epi32(0,7,6,5,4,3,2,1));
return rotated_right;
#else
__m256 shifted = _mm256_loadu_ps(src + 1);
__m256 bcast = _mm256_set1_ps(*src);
return _mm256_blend_ps(shifted, bcast, 0b10000000);
#endif
}
See the code + asm on Godbolt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With