Shifting SSE/AVX registers 32 bits left and right while shifting in zeros







I want to shift SSE/AVX registers multiples of 32 bits left or right while shifting in zeros.

Let me be more precise on the shifts I'm interested in. For SSE I want to do the following shifts of four 32bit floats:

shift1_SSE: [1, 2, 3, 4] -> [0, 1, 2, 3]
shift2_SSE: [1, 2, 3, 4] -> [0, 0, 1, 2]

For AVX I want to shift do the following shifts:

shift1_AVX: [1, 2, 3, 4, 5, 6, 7, 8] -> [0, 1, 2, 3, 4, 5, 6, 7]
shift2_AVX: [1, 2, 3, 4, 5, 6, 7, 8] -> [0, 0, 1, 2, 3, 4, 5, 6]
shift3_AVX: [1, 2, 3, 4 ,5 ,6, 7, 8] -> [0, 0, 0, 0, 1, 2, 3, 4]

For SSE I have come up with the following code

shift1_SSE = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)); 
shift2_SSE = _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40);
//shift2_SSE = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8));

Is there a better way to do this with SSE?

For AVX I have come up with the following code which needs AVX2 (and it's untested). Edit (as explained by Paul R this code won't work).

shift1_AVX2 =_mm256_castsi256_ps(_mm256_slli_si256(_mm256_castps_si256(x), 4)));
shift2_AVX2 =_mm256_castsi256_ps(_mm256_slli_si256(_mm256_castps_si256(x), 8)));
shift3_AVX2 =_mm256_castsi256_ps(_mm256_slli_si256(_mm256_castps_si256(x), 12))); 

How can I do this best with AVX not AVX2 (for example with _mm256_permute or _mm256_shuffle`)? Is there a better way to do this with AVX2?


Paul R has informed me that my AVX2 code won't work and that AVX code is probably not worth it. Instead for AVX2 I should use _mm256_permutevar8x32_ps along with _mm256_and_ps. I don't have a system with AVX2 (Haswell) so this is hard to test.

Edit: Based on Felix Wyss's answer I came up with some solutions for AVX which only needs 3 intrisnics for shift1_AVX and shift2_AVX and only one intrinsic for shift3_AVX. This is due to the fact that _mm256_permutef128Ps has a zeroing feature.


__m256 t0 = _mm256_permute_ps(x, _MM_SHUFFLE(2, 1, 0, 3));       
__m256 t1 = _mm256_permute2f128_ps(t0, t0, 41);          
__m256 y = _mm256_blend_ps(t0, t1, 0x11);


__m256 t0 = _mm256_permute_ps(x, _MM_SHUFFLE(1, 0, 3, 2));
__m256 t1 = _mm256_permute2f128_ps(t0, t0, 41);
__m256 y = _mm256_blend_ps(t0, t1, 0x33);


x = _mm256_permute2f128_ps(x, x, 41);
You can do a shift right with _mm256_permute_ps, _mm256_permute2f128_ps, and _mm256_blend_ps as follows:

__m256 t0 = _mm256_permute_ps(x, 0x39);            // [x4  x7  x6  x5  x0  x3  x2  x1]
__m256 t1 = _mm256_permute2f128_ps(t0, t0, 0x81);  // [ 0   0   0   0  x4  x7  x6  x5] 
__m256 y  = _mm256_blend_ps(t0, t1, 0x88);         // [ 0  x7  x6  x5  x4  x3  x2  x1]

The result is in y. In order to do a rotate right, set the permute mask to 0x01 instead of 0x81. Shift/rotate left and larger shifts/rotates can be done similarly by changing the permute and blend control bytes.

