Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Minimum of 4 SP values in __m128

Tags:

c

simd

sse

Suppose to have a __m128 variable holding 4 SP values, and you want the minimum one, is there any intrinsic function available, or anything other than the naive linear comparison among the values?

Right know my solution is the following (suppose the input __m128 variable is x):

x = _mm_min_ps(x, (__m128)_mm_srli_si128((__m128i)x, 4));
min = _mm_min_ss(x, (__m128)_mm_srli_si128((__m128i)x, 8))[0];

Which is quite horrible but it's working (btw, is there anything like _mm_srli_si128 but for the __m128 type?)

like image 401
Filippo Bistaffa Avatar asked Jul 14 '13 10:07

Filippo Bistaffa


2 Answers

There is no single instruction/intrinsic but you can do it with two shuffles and two mins:

__m128 _mm_hmin_ps(__m128 v)
{
    v = _mm_min_ps(v, _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 1, 0, 3)));
    v = _mm_min_ps(v, _mm_shuffle_ps(v, v, _MM_SHUFFLE(1, 0, 3, 2)));
    return v;
}

The output vector will contain the min of all the elements in the input vector, replicated throughout the output vector.

like image 81
Paul R Avatar answered Sep 16 '22 18:09

Paul R


Paul R's answer is great! (@Paul R - if you read this thank you!) I just wanted to try to explain how it actually works for anyone new to SSE stuff like me. Of course I might be wrong somewhere, so any corrections are welcome!

How does _mm_shuffle_ps work?

First of all, SSE registers have indexes that go in reverse to what you might expect, like this:

[6, 9, 8, 5] // values
 3  2  1  0  // indexes

This order of indexing makes vector left-shifts move data from low to high indices, just like left-shifting the bits in an integer. The most-significant element is at the left.


_mm_shuffle_ps can mix the contents of two registers:

// __m128 a : (a3, a2, a1, a0)
// __m128 b : (b3, b2, b1, b0)
__m128 two_from_a_and_two_from_b = _mm_shuffle_ps(b, a, _MM_SHUFFLE(3, 2,   1, 0));
//                                                                  ^  ^    ^  ^ 
//                                            indexes into second operand    indexes into first operand
// two_from_a_and_two_from_b : (a3, a2, b1, b0)

Here, we only want to shuffle the values of one register, not two. We can do that by passing v as both parameters, like this (you can see this in Paul R's function):

// __m128 v : (v3, v2, v1, v0)
__m128 v_rotated_left_by_1 = _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 1, 0, 3));
// v_rotated_left_by_1 : (v2, v1, v0, v3) // i.e. move all elements left by 1 with wraparound

I'm going to wrap it in a macro for readability though:

#define mm_shuffle_one(v, pattern)  _mm_shuffle_ps(v, v, pattern)

(It can't be a function because the pattern argument to _mm_shuffle_ps must be constant at compile time.)

Here's a slightly modified version of the actual function – I added intermediate names for readability, as the compiler optimizes them out anyway:

inline __m128 _mm_hmin_ps(__m128 v){
    __m128  v_rotated_left_by_1 = mm_shuffle_one(v,  _MM_SHUFFLE(2, 1, 0, 3));
    __m128 v2 = _mm_min_ps(v,   v_rotated_left_by_1);

    __m128 v2_rotated_left_by_2 = mm_shuffle_one(v2, _MM_SHUFFLE(1, 0, 3, 2));
    __m128 v3 = _mm_min_ps(v2, v2_rotated_left_by_2);

    return v3;
}

Why are shuffling the elements the way we are? And how do we find the smallest of four elements with just two min operations?

I had some trouble following how you can min 4 floats with just two vectorized min operations, but I understood it when I manually followed which values are min'd together, step by step. (Though it's likely more fun to do it on your own than read it)

Say we've got v:

[7,6,9,5] v

First, we min the values of v and v_rotated_left_by_1:

[7,6,9,5] v
 3 2 1 0  // (just the indices of the elements)
[6,9,5,7] v_rotated_left_by_1
 2 1 0 3  // (the indexes refer to v, and we rotated it left by 1, so the indices are shifted)
--------- min
[6,6,5,5] v2
 3 2 1 0 // (explained
 2 1 0 3 //  below    )

Each column under an element of v2 tracks which indexes of v were min'd together to get that element. So, going column-wise left to right:

v2[3] == 6 == min(v[3], v[2])
v2[2] == 6 == min(v[2], v[1])
v2[1] == 5 == min(v[1], v[0])
v2[0] == 5 == min(v[0], v[3])

Now the second min:

[6,6,5,5] v2
 3 2 1 0
 2 1 0 3
[5,5,6,6] v2_rotated_left_by_2
 1 0 3 2
 0 3 2 1
--------- min
[5,5,5,5] v3
 3 2 1 0
 2 1 0 3
 1 0 3 2
 0 3 2 1

Voila! Each column under v3 contains (3,2,1,0) - each element of v3 has been mind with all the elements of v - so each element contains the minimum of the whole vector v.

After using the function, you can extract the minimum value with float _mm_cvtss_f32(__m128):

__m128 min_vector = _mm_hmin_ps(my_vector);
float minval = _mm_cvtss_f32(min_vector);

***

This is just a tangential thought, but what I found interesting is that this approach could be extended to sequences of arbitrary length, rotating the result of the previous step by 1, 2, 4, 8, ... 2**ceil(log2(len(v))) (i think) at each step. That's cool from a theoretical perspective - if you can compare two sequences element-wise simultaneously, you can find the minimum/maximum1 of a sequences in logarithmic time!

1 This extends to all horizontal folds/reductions, like sum. Same shuffles, different vertical operation.

However, AVX (256-bit vectors) makes 128-bit boundaries special, and harder to shuffle across. If you only want a scalar result, extract the high half so every step narrows the vector width in half. (Like in Fastest way to do horizontal float vector sum on x86, which has more efficient shuffles than 2x shufps for 128-bit vectors, avoiding some movaps instructions when compiling without AVX.)

But if you want the result broadcast to every element like @PaulR's answer, you'd want to do in-lane shuffles (i.e. rotate within the 4 elements in every lane), then swap halves, or rotate 128-bit lanes.

like image 39
uryga Avatar answered Sep 19 '22 18:09

uryga