I'm thinking of implementing 8-ary heapsort for uint32_t's. To do this I need a function that selects the index of maximum element in a 8-element vector so that I can compare it with parent element and conditionally perform swap and further siftDown steps. (8 uint32_ts can be changed eg to 16 uint32_ts or 8 uint64_t or whatever x86 SIMD could support efficiently). I have some ideas on how to do that but I'm looking for something faster than non-vectorized code, especially I'm looking for something that would enable me to do fast heapsort. I have clang++ 3.3 and Core i7-4670 so probably I should be able to use even the newest x86 SIMD thingies. (BTW: that's a part of a bigger project: https://github.com/tarsa/SortingAlgorithmsBenchmark and there's for example quaternary heapsort so after implementing SIMD heapsort I could instantly compare them) To repeat - question is: what's the most efficient way to compute index of maximum element in x86 SIMD vector? PS: It's not a duplicate of linked questions - notice that I'm asking for an index of a maximum element, not just the element value.

Horizontal operations are bad news with SIMD, and particularly so with AVX, where most 256 bit instructions are actually broken into two separate 128 bit operations. Having said that, if you really must do a horizontal 32 bit max across 8 elements then I think the general approach would have to be: <ul> <li>find the max value (typically several shift/permute and max operations)</li> <li>splat the max value across all 8 elements of a second vector (can be combined with previous operation)</li> <li>do a compare between the original vector and the max vector (<code>_mm256_cmpeq_epi32</code>)</li> <li>extract scalar mask (<code>_mm256_movemask_epi8</code>)</li> <li>convert scalar mask to index</li> </ul> Here is a first pass at an AVX2 implementation I just put together - I tested it and benchmarked it on a 2.6 GHz Haswell and it runs at around 1.7 ns / vector (including loading the vector and storing the resulting index): <pre class="prettyprint"><code>uint8_t _mm256_hmax_index(const __m256i v) { __m256i vmax = v; vmax = _mm256_max_epu32(vmax, _mm256_alignr_epi8(vmax, vmax, 4)); vmax = _mm256_max_epu32(vmax, _mm256_alignr_epi8(vmax, vmax, 8)); vmax = _mm256_max_epu32(vmax, _mm256_permute2x128_si256(vmax, vmax, 0x01)); __m256i vcmp = _mm256_cmpeq_epi32(v, vmax); uint32_t mask = _mm256_movemask_epi8(vcmp); return __builtin_ctz(mask) >> 2; } </code></pre>

Find index of maximum element in x86 SIMD vector

Tags:

c++

x86

avx

intel

simd

sse

I'm thinking of implementing 8-ary heapsort for uint32_t's. To do this I need a function that selects the index of maximum element in a 8-element vector so that I can compare it with parent element and conditionally perform swap and further siftDown steps.

(8 uint32_ts can be changed eg to 16 uint32_ts or 8 uint64_t or whatever x86 SIMD could support efficiently).

I have some ideas on how to do that but I'm looking for something faster than non-vectorized code, especially I'm looking for something that would enable me to do fast heapsort.

I have clang++ 3.3 and Core i7-4670 so probably I should be able to use even the newest x86 SIMD thingies.

(BTW: that's a part of a bigger project: https://github.com/tarsa/SortingAlgorithmsBenchmark and there's for example quaternary heapsort so after implementing SIMD heapsort I could instantly compare them)

To repeat - question is: what's the most efficient way to compute index of maximum element in x86 SIMD vector?

PS: It's not a duplicate of linked questions - notice that I'm asking for an index of a maximum element, not just the element value.

564

asked May 11 '14 08:05

Wibowit

3 Answers

Horizontal operations are bad news with SIMD, and particularly so with AVX, where most 256 bit instructions are actually broken into two separate 128 bit operations. Having said that, if you really must do a horizontal 32 bit max across 8 elements then I think the general approach would have to be:

find the max value (typically several shift/permute and max operations)
splat the max value across all 8 elements of a second vector (can be combined with previous operation)
do a compare between the original vector and the max vector (_mm256_cmpeq_epi32)
extract scalar mask (_mm256_movemask_epi8)
convert scalar mask to index

Here is a first pass at an AVX2 implementation I just put together - I tested it and benchmarked it on a 2.6 GHz Haswell and it runs at around 1.7 ns / vector (including loading the vector and storing the resulting index):

Click to copy

uint8_t _mm256_hmax_index(const __m256i v)
{
    __m256i vmax = v;

    vmax = _mm256_max_epu32(vmax, _mm256_alignr_epi8(vmax, vmax, 4));
    vmax = _mm256_max_epu32(vmax, _mm256_alignr_epi8(vmax, vmax, 8));
    vmax = _mm256_max_epu32(vmax, _mm256_permute2x128_si256(vmax, vmax, 0x01));

    __m256i vcmp = _mm256_cmpeq_epi32(v, vmax);

    uint32_t mask = _mm256_movemask_epi8(vcmp);

    return __builtin_ctz(mask) >> 2;
}

answered Oct 03 '22 00:10

Paul R

The most efficient way to do a horizontal operation (dot product, sum, max-index, whatever) on an n-way SIMD vector is to do n of them at once, by transposing them and using vertical ops instead. Certain SIMD architectures have better support for horizontal ops, but in general the blockwise transposed approach will be more flexible and efficient.

answered Oct 05 '22 00:10

Sneftel

Using the Vc library I'd write:

Click to copy

size_t maximumIndex(Vc::uint_v vec) {
  const unsigned int max = vec.max();
  return (max == vec).firstOne();
}

With intrinsics it should be something along these lines (this is AVX without AVX2 - with AVX2 it becomes slightly easier):

Click to copy

size_t maximumIndex(_mm256i vec) {
  __m128i lo = _mm256_castsi256_si128(vec);
  __m128i hi = _mm256_extractf128_si256(vec, 1);
  __m128i tmp = _mm_max_epu32(lo, hi);
  tmp = _mm_max_epu32(tmp, _mm_shuffle_epi32(tmp, _MM_SHUFFLE(1, 0, 3, 2)));
  tmp = _mm_max_epu32(tmp, _mm_shufflelo_epi16(tmp, _MM_SHUFFLE(1, 0, 3, 2))); // using lo_epi16 for speed here
  const int max = _mm_cvtsi128_si32(tmp);
  tmp = _mm_packs_epi16(_mm_packs_epi32(_mm_cmpeq_epi32(_mm_set1_epi32(max), lo),
                                        _mm_cmpeq_epi32(_mm_set1_epi32(max), hi)),
                        _mm_setzero_si128());
  return _bit_scan_forward(_mm_movemask_epi8(tmp));
}

BTW, if you want some inspiration from SIMDized merge-sort look here: http://code.compeng.uni-frankfurt.de/projects/vc/repository/revisions/master/entry/src/avx_sorthelper.cpp

answered Oct 05 '22 00:10

Vir

Related questions
                            
                                Hashing an unordered sequence of small integers
                            
                                n-dimensional vector
                            
                                Using boost::assign::list_of
                            
                                template base class typedef members invisible
                            
                                Program Crash on glGenVertexArrays call
                            
                                choose which constructor in c++
                            
                                Does `int a = 0, b = a` have undefined behavior? [duplicate]
                            
                                C++ 11 Thread initialization with member functions compiling error [duplicate]
                            
                                Is there a standard abstraction for semirings or monoids in C++?
                            
                                About sizeof of a class member function pointer [duplicate]
                            
                                Is there a safe way to get the unsigned absolute value of a signed integer, without triggering overflow?
                            
                                Is std::nth_element defined for ranges containing same values?
                            
                                Change order of columns in QTableView
                            
                                Examples of Matlab to OpenCV conversions
                            
                                Create tests at run-time (google test)
                            
                                Is a C++11 mutex compatible with threads NOT created with C++11?
                            
                                Type-casting enum to integer and vice versa [closed]
                            
                                passing a const char instead of a std::string as function argument
                            
                                Is it possible to conditionally enable a non-const/const data member of a template class based on the constness of the template argument?
                            
                                std::vector emplace_back() for non copy constructible objects

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find index of maximum element in x86 SIMD vector

Tags:

c++

x86

avx

intel

simd

sse

Wibowit

People also ask

3 Answers

Paul R

Sneftel

Vir

Recent Activity

Donate For Us