Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Checking If A Vector Contains Any Element Greater Than Zero

Tags:

c

vector

avx

I will be thankful if somebody can help in writing a function that receives an AVX vector and checks if it contains any element greater than zero ..

I have written the following code but it is not optimal because it stores the elements and then manipulate it.. the vector should be checked as a whole.

int check(__m256 vector)
{
  float * temp;
  posix_memalign ((void **) &temp, 32, 8 * sizeof(float));    
  _mm256_store_ps( temp, vector );

  int flag=0;
  for(int k=0; k<8; k++)
  {
    flag= ( (temp[k]>0) ? 1 : 0 );
    if (flag==1) return 1;
  }

  free( temp);
  return 0;
}
like image 695
MROF Avatar asked Oct 19 '22 23:10

MROF


1 Answers

If you're going to branch on the result, it's usually fewer uops to use the "traditional" compare / movemask / integer-test, like you would with SSE1.

__m256 vcmp = _mm256_cmp_ps(_mm256_setzero_ps(), x, _CMP_LT_OQ);
int cmp = _mm256_movemask_ps(vcmp);
if (cmp)
    return 1;

This typically compiles to something like

vcmplt_oqps  ymm2, ymm0, ymm1
vpmovmskb    eax, ymm2

test         eax,eax
jnz      .true_branch

Those are all single-uop instructions, and test/jnz macro-fuse on Intel and AMD CPUs that support AVX, so this is only 3 total uops (on Intel).

See Agner Fog's instruction tables + microarch guide, and other guides linked from https://stackoverflow.com/tags/x86/info.


You can also use PTEST, but it's less efficient for this case. See _mm_testc_ps and _mm_testc_pd vs _mm_testc_si128

Without AVX, ptest handy for checking if a register is all-zero without needing extra instructions to copy it (since it sets integer flags directly). But since it's 2 uops, and can't macro-fuse with a jcc branch instruction, it's actually worse than the above:

// don't use, sub-optimal
__m256 vcmp = _mm256_cmp_ps(_mm256_setzero_ps(), x, _CMP_LT_OQ);
if (!_mm256_testz_si256(vcmp, vcmp)) {
    return 1;
}

The testz intrinsic is PTEST. It sets the ZF and CF flags directly based on the results of AND and AND NOT of its args. The testz intrinsic is true when vcmp has any non-zero bits. (which it will only when vcmpps puts some there.)

VPTEST with ymm regs is available with just AVX. AVX2 isn't required even though it looks like a vector-integer instruction.

This will compile to something like

vcmplt_oqps  ymm2, ymm0, ymm1
vptest       ymm2, ymm2
jnz      .true_branch

Probably smaller code-size than the above, but this is actually 4 uops instead of 3. If you were using setnz or cmovnz, macro-fusion wouldn't be a factor, so ptest would be break-even. As I mentioned above, the main use-case for ptest is when you can use it without a compare instruction, and without AVX.

The alternative for checking a vector for all-zero (pcmpeqb xmm0,xmm1 / pmovmskb eax, xmm1 / test eax,eax) has to destroy one of the input vectors without AVX, so it will require an extra movdqa instruction to copy if you still need both after the test.


ptest floating point bit-hacks

I think for this specific test, it might be possible to skip the compare instruction and use vptest directly to see if there are any float elements with their sign bit unset, but some non-zero bits elsewhere.

Actually no, that idea can't work, because it doesn't respect element boundaries. It couldn't tell the difference between a vector with a positive element vs. a vector with a +0.0 element (sign bit clear) and another element that was negative (other bits set).

vptest sets CF=bool(~src1 & src2) and ZF=(src1 & src2). I was thinking that src1=set1(0x7FFFFFFF) could tell us something useful about sign bits and non-sign bits, which we could test with a condition that checks CF and ZF. For example ja: CF=0 and ZF=0. There actually isn't an x86 condition that's only true with CF=1 and ZF=0, though, so that's another problem.

Also NaN > 0 is false, but NaN has some set bits. (exponent all-ones, mantissa non-zero, sign-bit = don't care so there can be +NaN and -NaN). If that was the only problem, this would still be useful in cases where NaN-handling isn't required.

like image 128
Peter Cordes Avatar answered Oct 23 '22 00:10

Peter Cordes