The AVX2 intrinsic _mm256_permutevar8x32_ps
can perform shuffling across the lanes, which is quite useful for sorting array of length 8.
Now I only have AVX (Ivy Bridge) and want to do the same thing in minimal cycles. Note that both data and index are input and unknown at compile-time.
For example, the array is [1,2,3,4,5,6,7,8]
and indices are [3,0,1,7,6,5,2,4]
, output should be [4,1,2,8,7,6,3,5]
.
The control masks of most handy intrinsics must be constants (without the "var" suffix), thus not suitable in this case.
Thanks in advance.
To permute across lanes in AVX you can permute within lanes, then use _mm256_permute2f128_ps
to swap lanes, and then blend. For example. Let's assume you want to change the array {1, 2, 3, 4, 5, 6, 7, 8} to {0, 0, 1, 2, 3, 4, 5, 6}. You can do that like this
__m256 t0 = _mm256_permute_ps(x, _MM_SHUFFLE(1, 0, 3, 2));
__m256 t1 = _mm256_permute2f128_ps(t0, t0, 41);
__m256 y = _mm256_blend_ps(t0, t1, 0x33);
_mm256_permute2f128_ps
also has a zeroing feature which can be quite useful (see also the Intel Intrinsics Guide Online). I used it in the code above to swap the first lane to the second lane and then zero the first lane. See
shifting-sse-avx-registers-32-bits-left-and-right-while-shifting-in-zeros for more details.
Edit: the permutevar intrinsics allow runtime permutes and are therefore not limited to compile time constants. The code below is the lookup8
function from Agner Fog's Vector Class Library.
static inline Vec8f lookup8(Vec8i const & index, Vec8f const & table) {
#if INSTRSET >= 8 && VECTORI256_H > 1 // AVX2
#if defined (_MSC_VER) && _MSC_VER < 1700 && ! defined(__INTEL_COMPILER)
// bug in MS VS 11 beta: operands in wrong order. fixed in 11.0
return _mm256_permutevar8x32_ps(_mm256_castsi256_ps(index), _mm256_castps_si256(table));
#elif defined (GCC_VERSION) && GCC_VERSION <= 40700 && !defined(__INTEL_COMPILER) && !defined(__clang__)
// Gcc 4.7.0 has wrong parameter type and operands in wrong order. fixed in version 4.7.1
return _mm256_permutevar8x32_ps(_mm256_castsi256_ps(index), table);
#else
// no bug version
return _mm256_permutevar8x32_ps(table, index);
#endif
#else // AVX
// swap low and high part of table
__m256 t1 = _mm256_castps128_ps256(_mm256_extractf128_ps(table, 1));
__m256 t2 = _mm256_insertf128_ps(t1, _mm256_castps256_ps128(table), 1);
// join index parts
__m256i index2 = _mm256_insertf128_si256(_mm256_castsi128_si256(index.get_low()), index.get_high(), 1);
// permute within each 128-bit part
__m256 r0 = _mm256_permutevar_ps(table, index2);
__m256 r1 = _mm256_permutevar_ps(t2, index2);
// high index bit for blend
__m128i k1 = _mm_slli_epi32(index.get_high() ^ 4, 29);
__m128i k0 = _mm_slli_epi32(index.get_low(), 29);
__m256 kk = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_castsi128_ps(k0)), _mm_castsi128_ps(k1), 1);
// blend the two permutes
return _mm256_blendv_ps(r0, r1, kk);
#endif
}
Here are the get_low
and get_high
functions:
Vec2db get_low() const {
return _mm256_castpd256_pd128(ymm);
}
Vec2db get_high() const {
return _mm256_extractf128_pd(ymm,1);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With