I am trying to convert the following code to SSE/AVX:
float x1, x2, x3;
float a1[], a2[], a3[], b1[], b2[], b3[];
for (i=0; i < N; i++)
{
if (x1 > a1[i] && x2 > a2[i] && x3 > a3[i] && x1 < b1[i] && x2 < b2[i] && x3 < b3[i])
{
// do something with i
}
}
Here N is a small constant, let's say 8. The if(...) statement evaluates to false most of the time.
First attempt:
__m128 x; // x1, x2, x3, 0
__m128 a[N]; // packed a1[i], a2[i], a3[i], 0
__m128 b[N]; // packed b1[i], b2[i], b3[i], 0
for (int i = 0; i < N; i++)
{
__m128 gt_mask = _mm_cmpgt_ps(x, a[i]);
__m128 lt_mask = _mm_cmplt_ps(x, b[i]);
__m128 mask = _mm_and_ps(gt_mask, lt_mask);
if (_mm_movemask_epi8 (_mm_castps_si128(mask)) == 0xfff0)
{
// do something with i
}
}
This works, and is reasonably fast. The question is, is there be a more efficient way of doing this? In particular, if there is a register with results from SSE or AVX comparisons on floats (which put 0xffff
or 0x0000
in that slot), how can the results of all the comparisons be (for example) and-ed or or-ed together, in general? Is PMOVMSKB
(or the corresponding _mm_movemask
intrinsic) the standard way to do this?
Also, how can AVX 256-bit registers be used instead of SSE in the code above?
EDIT:
Tested and benchmarked a version using VPTEST (from _mm_test* intrinsic) as suggested below.
__m128 x; // x1, x2, x3, 0
__m128 a[N]; // packed a1[i], a2[i], a3[i], 0
__m128 b[N]; // packed b1[i], b2[i], b3[i], 0
__m128i ref_mask = _mm_set_epi32(0xffff, 0xffff, 0xffff, 0x0000);
for (int i = 0; i < N; i++)
{
__m128 gt_mask = _mm_cmpgt_ps(x, a[i]);
__m128 lt_mask = _mm_cmplt_ps(x, b[i]);
__m128 mask = _mm_and_ps(gt_mask, lt_mask);
if (_mm_testc_si128(_mm_castps_si128(mask), ref_mask))
{
// do stuff with i
}
}
This also works, and is fast. Benchmarking this (Intel i7-2630QM, Windows 7, cygwin 1.7, cygwin gcc 4.5.3 or mingw x86_64 gcc 4.5.3, N=8) shows this to be identical speed to the code above (within less than 0.1%) on 64bit. Either version of the inner loop runs in about 6.8 clocks average on data which is all in cache and for which the comparison returns always false.
Interestingly, on 32bit, the _mm_test version runs about 10% slower. It turns out that the compiler spills the masks after loop unrolling and has to re-read them back; this is probably unnecessary and can be avoided in hand-coded assembly.
Which method to choose? It seems that there is no compelling reason to prefer VPTEST
over VMOVMSKPS
. Actually, there is a slight reason to prefer VMOVMSKPS
, namely it frees up a xmm register which would otherwise be taken up by the mask.
If you're working with floats, you generally want to use MOVMSKPS
(and the corresponding AVX instruction VMOVMSKPS
) instead of PMOVMSKB
.
That aside, yes, this is one standard way of doing this; you can also use PTEST
(VPTEST
) to directly update the condition flags based on the result of an SSE or AVX AND or ANDNOT.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With