Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

implement _mm256_permutevar8x32_ps using AVX instructions

Tags:

c++

avx

simd

sse

The AVX2 intrinsic _mm256_permutevar8x32_ps can perform shuffling across the lanes, which is quite useful for sorting array of length 8.

Now I only have AVX (Ivy Bridge) and want to do the same thing in minimal cycles. Note that both data and index are input and unknown at compile-time.

For example, the array is [1,2,3,4,5,6,7,8] and indices are [3,0,1,7,6,5,2,4], output should be [4,1,2,8,7,6,3,5].

The control masks of most handy intrinsics must be constants (without the "var" suffix), thus not suitable in this case.

Thanks in advance.

like image 449
lzhang3 Avatar asked Jun 20 '14 08:06

lzhang3


1 Answers

To permute across lanes in AVX you can permute within lanes, then use _mm256_permute2f128_ps to swap lanes, and then blend. For example. Let's assume you want to change the array {1, 2, 3, 4, 5, 6, 7, 8} to {0, 0, 1, 2, 3, 4, 5, 6}. You can do that like this

__m256 t0 = _mm256_permute_ps(x, _MM_SHUFFLE(1, 0, 3, 2));
__m256 t1 = _mm256_permute2f128_ps(t0, t0, 41);
__m256 y = _mm256_blend_ps(t0, t1, 0x33);

_mm256_permute2f128_ps also has a zeroing feature which can be quite useful (see also the Intel Intrinsics Guide Online). I used it in the code above to swap the first lane to the second lane and then zero the first lane. See shifting-sse-avx-registers-32-bits-left-and-right-while-shifting-in-zeros for more details.

Edit: the permutevar intrinsics allow runtime permutes and are therefore not limited to compile time constants. The code below is the lookup8 function from Agner Fog's Vector Class Library.

static inline Vec8f lookup8(Vec8i const & index, Vec8f const & table) {
#if INSTRSET >= 8 && VECTORI256_H > 1 // AVX2
#if defined (_MSC_VER) && _MSC_VER < 1700 && ! defined(__INTEL_COMPILER)        
    // bug in MS VS 11 beta: operands in wrong order. fixed in 11.0
    return _mm256_permutevar8x32_ps(_mm256_castsi256_ps(index), _mm256_castps_si256(table)); 
#elif defined (GCC_VERSION) && GCC_VERSION <= 40700 && !defined(__INTEL_COMPILER) && !defined(__clang__)
        // Gcc 4.7.0 has wrong parameter type and operands in wrong order. fixed in version 4.7.1
    return _mm256_permutevar8x32_ps(_mm256_castsi256_ps(index), table);
#else
    // no bug version
    return _mm256_permutevar8x32_ps(table, index);
#endif

#else // AVX
    // swap low and high part of table
    __m256  t1 = _mm256_castps128_ps256(_mm256_extractf128_ps(table, 1));
    __m256  t2 = _mm256_insertf128_ps(t1, _mm256_castps256_ps128(table), 1);
    // join index parts
    __m256i index2 = _mm256_insertf128_si256(_mm256_castsi128_si256(index.get_low()), index.get_high(), 1);
    // permute within each 128-bit part
    __m256  r0 = _mm256_permutevar_ps(table, index2);
    __m256  r1 = _mm256_permutevar_ps(t2,    index2);
    // high index bit for blend
    __m128i k1 = _mm_slli_epi32(index.get_high() ^ 4, 29);
    __m128i k0 = _mm_slli_epi32(index.get_low(),      29);
    __m256  kk = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_castsi128_ps(k0)), _mm_castsi128_ps(k1), 1);
    // blend the two permutes
    return _mm256_blendv_ps(r0, r1, kk);
#endif
}

Here are the get_low and get_high functions:

Vec2db get_low() const {
    return _mm256_castpd256_pd128(ymm);
}
Vec2db get_high() const {
    return _mm256_extractf128_pd(ymm,1);
}
like image 180
Z boson Avatar answered Oct 14 '22 03:10

Z boson