Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Horizontal minimum and position in SSE for unsigned 32-bit integers

I am looking for a way to find the minimum and its position in SSE for unsigned 32-bit integers (similar to _mm_minpos_epu16). I know I can find the minimum through a series of _mm_min_epu32 and shuffles/shifts but that doesn't get me the position.

Does anyone have any cool ways of doing this?

like image 388
ChipK Avatar asked Jan 09 '23 23:01

ChipK


2 Answers

In general if one is using horizontal operators with SIMD that's a good indication that SIMD is not being used optimally. However, horizontal operations are fine at the end of a loop in which case I would just do

int result[4] __attribute__((aligned(16)));
_mm_store_si128((__m128i *) result, v);
for(int i=0; i<4; i++) if(result[i]<min) { min = result[i]; index = i; }

Nevertheless, here are some solutions using SSE. I don't know if they are any better than the code above.

The first solution is a variation on Paul R's answer.

vmin = _mm_min_epu32(vmin, _mm_alignr_epi8(vmin, vmin, 4));
vmin = _mm_min_epu32(vmin, _mm_alignr_epi8(vmin, vmin, 8));
__m128i vmask = _mm_cmpeq_epi32(v, vmin);
vmask = _mm_xor_si128(vmask, _mm_set1_epi32(-1));
__m128i vpos = _mm_minpos_epu16(vmask);

The second 16-bit word in vpos contains two times the position.

Here is another variation using _mm_minpos_epu16. It first finds the minimum upper 16-bits then masks out the values which are not in the minimum 16-bits (by setting them all high) and then it finds the minimum of the lower 16-bits as well as the position.

__m128i mask1 = _mm_setr_epi8(0x0,0x1,0x4,0x5, 0x8,0x9,0xc,0xd, 0x0,0x1,0x4,0x5,  0x8,0x9,0xc,0xd);
__m128i mask2 = _mm_setr_epi8(0x2,0x3,0x6,0x7, 0xa,0xb,0xe,0xf, 0x2,0x3,0x6,0x7,  0xa,0xb,0xe,0xf);
__m128i mask3 = _mm_set1_epi32(0x01000100);

The masks are constant so they can be calculated at compile time or outside of a loop.

__m128i lo = _mm_shuffle_epi8(v,mask1);            //lower 16-bits
__m128i hi = _mm_shuffle_epi8(v,mask2);            //upper 16-bits
__m128i t1 = _mm_minpos_epu16(hi);                 //upper 16-bits min
__m128i t2 = _mm_shuffle_epi8(t1, mask3);          //broadcast upper min
__m128i t3 = _mm_cmpeq_epi32(t2,hi);               //select equal
__m128i t4 = _mm_xor_si128(t3, _mm_set1_epi32(-1));//invert
__m128i t5 = _mm_or_si128(lo,t4);                   
__m128i t6 = _mm_minpos_epu16(t5);                 //lower 16-bits hi and position

the upper 16-bits of the minimum is in the first 16-bits of t1 and the lower 16-bits of the minimum is in the first 16-bits of t6. The position is in the second 16-bits of t6.

like image 126
Z boson Avatar answered Jan 27 '23 03:01

Z boson


There is probably a cleverer method, but for now here's a brute force approach:

#include <stdio.h>
#include <smmintrin.h> // SSE4.1

int main(void)
{
    __m128i v = _mm_setr_epi32(42, 1, 43, 2);

    printf("v     = %vlu\n", v);

    __m128i vmin = v;

    vmin = _mm_min_epu32(vmin, _mm_alignr_epi8(vmin, vmin, 4));
    vmin = _mm_min_epu32(vmin, _mm_alignr_epi8(vmin, vmin, 8));
                                                   // get min value in all elements of vmin

    printf("vmin  = %vlu\n", vmin);

    __m128i vmask = _mm_cmpeq_epi32(v, vmin);      // set min element(s) in mask to -1,
                                                   // all others to 0 [1]

    printf("vmask = %vld\n", vmask);

    int16_t mask = _mm_movemask_epi8(vmask);       // get mask as scalar [2]

    printf("mask  = %#x\n", mask);

    int pos = __builtin_ctz(mask) >> 2;            // convert scalar mask to index [3]

    printf("pos   = %d\n", pos);

    return 0;
}

If you can use a mask which is set at the position(s) of the minimum element(s) then you can just stop at [1], otherwise continue to [3] to get the index of the (least significant) minimum element.

Note also that __builtin_ctz is a gcc-specific intrinsic (although it's found in other gcc-compatible compilers too). If you're using MSVC then you'll need to use the equivalent Microsoft intrinsic (_BitScanForward).

like image 33
Paul R Avatar answered Jan 27 '23 03:01

Paul R