Suppose to have a __m128
variable holding 4 SP values, and you want the minimum one, is there any intrinsic function available, or anything other than the naive linear comparison among the values?
Right know my solution is the following (suppose the input __m128
variable is x
):
x = _mm_min_ps(x, (__m128)_mm_srli_si128((__m128i)x, 4));
min = _mm_min_ss(x, (__m128)_mm_srli_si128((__m128i)x, 8))[0];
Which is quite horrible but it's working (btw, is there anything like _mm_srli_si128
but for the __m128
type?)
There is no single instruction/intrinsic but you can do it with two shuffles and two mins:
__m128 _mm_hmin_ps(__m128 v)
{
v = _mm_min_ps(v, _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 1, 0, 3)));
v = _mm_min_ps(v, _mm_shuffle_ps(v, v, _MM_SHUFFLE(1, 0, 3, 2)));
return v;
}
The output vector will contain the min of all the elements in the input vector, replicated throughout the output vector.
Paul R's answer is great! (@Paul R - if you read this thank you!) I just wanted to try to explain how it actually works for anyone new to SSE stuff like me. Of course I might be wrong somewhere, so any corrections are welcome!
_mm_shuffle_ps
work?First of all, SSE registers have indexes that go in reverse to what you might expect, like this:
[6, 9, 8, 5] // values
3 2 1 0 // indexes
This order of indexing makes vector left-shifts move data from low to high indices, just like left-shifting the bits in an integer. The most-significant element is at the left.
_mm_shuffle_ps
can mix the contents of two registers:
// __m128 a : (a3, a2, a1, a0)
// __m128 b : (b3, b2, b1, b0)
__m128 two_from_a_and_two_from_b = _mm_shuffle_ps(b, a, _MM_SHUFFLE(3, 2, 1, 0));
// ^ ^ ^ ^
// indexes into second operand indexes into first operand
// two_from_a_and_two_from_b : (a3, a2, b1, b0)
Here, we only want to shuffle the values of one register, not two. We can do that by passing v as both parameters, like this (you can see this in Paul R's function):
// __m128 v : (v3, v2, v1, v0)
__m128 v_rotated_left_by_1 = _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 1, 0, 3));
// v_rotated_left_by_1 : (v2, v1, v0, v3) // i.e. move all elements left by 1 with wraparound
I'm going to wrap it in a macro for readability though:
#define mm_shuffle_one(v, pattern) _mm_shuffle_ps(v, v, pattern)
(It can't be a function because the pattern
argument to _mm_shuffle_ps
must be constant at compile time.)
Here's a slightly modified version of the actual function – I added intermediate names for readability, as the compiler optimizes them out anyway:
inline __m128 _mm_hmin_ps(__m128 v){
__m128 v_rotated_left_by_1 = mm_shuffle_one(v, _MM_SHUFFLE(2, 1, 0, 3));
__m128 v2 = _mm_min_ps(v, v_rotated_left_by_1);
__m128 v2_rotated_left_by_2 = mm_shuffle_one(v2, _MM_SHUFFLE(1, 0, 3, 2));
__m128 v3 = _mm_min_ps(v2, v2_rotated_left_by_2);
return v3;
}
min
operations?I had some trouble following how you can min
4 floats with just two vectorized min
operations, but I understood it when I manually followed which values are min
'd together, step by step. (Though it's likely more fun to do it on your own than read it)
Say we've got v
:
[7,6,9,5] v
First, we min
the values of v
and v_rotated_left_by_1
:
[7,6,9,5] v
3 2 1 0 // (just the indices of the elements)
[6,9,5,7] v_rotated_left_by_1
2 1 0 3 // (the indexes refer to v, and we rotated it left by 1, so the indices are shifted)
--------- min
[6,6,5,5] v2
3 2 1 0 // (explained
2 1 0 3 // below )
Each column under an element of v2
tracks which indexes of v
were min
'd together to get that element.
So, going column-wise left to right:
v2[3] == 6 == min(v[3], v[2])
v2[2] == 6 == min(v[2], v[1])
v2[1] == 5 == min(v[1], v[0])
v2[0] == 5 == min(v[0], v[3])
Now the second min
:
[6,6,5,5] v2
3 2 1 0
2 1 0 3
[5,5,6,6] v2_rotated_left_by_2
1 0 3 2
0 3 2 1
--------- min
[5,5,5,5] v3
3 2 1 0
2 1 0 3
1 0 3 2
0 3 2 1
Voila! Each column under v3
contains (3,2,1,0)
- each element of v3
has been min
d with all the elements of v
- so each element contains the minimum of the whole vector v
.
After using the function, you can extract the minimum value with float _mm_cvtss_f32(__m128)
:
__m128 min_vector = _mm_hmin_ps(my_vector);
float minval = _mm_cvtss_f32(min_vector);
This is just a tangential thought, but what I found interesting is that this approach could be extended to sequences of arbitrary length, rotating the result of the previous step by 1, 2, 4, 8, ... 2**ceil(log2(len(v)))
(i think) at each step.
That's cool from a theoretical perspective - if you can compare two sequences element-wise simultaneously, you can find the minimum/maximum1 of a sequences in logarithmic time!
1 This extends to all horizontal folds/reductions, like sum. Same shuffles, different vertical operation.
However, AVX (256-bit vectors) makes 128-bit boundaries special, and harder to shuffle across. If you only want a scalar result, extract the high half so every step narrows the vector width in half. (Like in Fastest way to do horizontal float vector sum on x86, which has more efficient shuffles than 2x shufps
for 128-bit vectors, avoiding some movaps
instructions when compiling without AVX.)
But if you want the result broadcast to every element like @PaulR's answer, you'd want to do in-lane shuffles (i.e. rotate within the 4 elements in every lane), then swap halves, or rotate 128-bit lanes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With