Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to check if all __m128i components are 0 [using <= SSE4.1 intrinsics]

I am using SSE intrinsics to determine if a rectangle (defined by four int32 values) has changed:

__m128i oldRect; // contains old left, top, right, bottom packed to 128 bits
__m128i newRect; // contains new left, top, right, bottom packed to 128 bits

__m128i xor = _mm_xor_si128(oldRect, newRect);

At this point, the resulting xor value will be all zeros if the rectangle hasn't changed. What is then the most efficient way of determining that?

Currently I am doing so:

if (xor.m128i_u64[0] | xor.m128i_u64[1])
{
    // rectangle changed
}

But I assume there's a smarter way (possibly using some SSE instruction that I haven't found yet).

I am targeting SSE4.1 on x64 and I am coding C++ in Visual Studio 2013.

Edit: The question is not quite the same as Is an __m128i variable zero?, as that specifies "on SSE-2-and-earlier processors" (although Antonio did add an answer "for completeness" that addresses 4.1 some time after this question was posted and answered).

like image 405
d7samurai Avatar asked Jan 12 '15 15:01

d7samurai


2 Answers

You can use the PTEST instuction via the _mm_testz_si128 intrinsic (SSE4.1), like this:

#include "smmintrin.h" // SSE4.1 header

if (!_mm_testz_si128(xor, xor))
{
    // rectangle has changed
}

Note that _mm_testz_si128 returns 1 if the bitwise AND of the two arguments is zero.

like image 98
Paul R Avatar answered Nov 14 '22 10:11

Paul R


Ironically, ptest instruction from SSE 4.1 may be slower than pmovmskb from SSE2 in some cases. I suggest using simply:

__m128i cmp = _mm_cmpeq_epi32(oldRect, newRect);
if (_mm_movemask_epi8(cmp) != 0xFFFF)
  //registers are different

Note that if you really need that xor value, you'll have to compute it separately.

For Intel processors like Ivy Bridge, the version by PaulR with xor and _mm_testz_si128 translates into 4 uops, while suggested version without computing xor translates into 3 uops (see also this thread). This may result in better throughput of my version.

like image 34
stgatilov Avatar answered Nov 14 '22 10:11

stgatilov