I will be thankful if somebody can help in writing a function that receives an AVX vector and checks if it contains any element greater than zero ..
I have written the following code but it is not optimal because it stores the elements and then manipulate it.. the vector should be checked as a whole.
int check(__m256 vector)
{
float * temp;
posix_memalign ((void **) &temp, 32, 8 * sizeof(float));
_mm256_store_ps( temp, vector );
int flag=0;
for(int k=0; k<8; k++)
{
flag= ( (temp[k]>0) ? 1 : 0 );
if (flag==1) return 1;
}
free( temp);
return 0;
}
If you're going to branch on the result, it's usually fewer uops to use the "traditional" compare / movemask / integer-test, like you would with SSE1.
__m256 vcmp = _mm256_cmp_ps(_mm256_setzero_ps(), x, _CMP_LT_OQ);
int cmp = _mm256_movemask_ps(vcmp);
if (cmp)
return 1;
This typically compiles to something like
vcmplt_oqps ymm2, ymm0, ymm1
vpmovmskb eax, ymm2
test eax,eax
jnz .true_branch
Those are all single-uop instructions, and test/jnz macro-fuse on Intel and AMD CPUs that support AVX, so this is only 3 total uops (on Intel).
See Agner Fog's instruction tables + microarch guide, and other guides linked from https://stackoverflow.com/tags/x86/info.
You can also use PTEST, but it's less efficient for this case. See _mm_testc_ps and _mm_testc_pd vs _mm_testc_si128
Without AVX, ptest
handy for checking if a register is all-zero without needing extra instructions to copy it (since it sets integer flags directly). But since it's 2 uops, and can't macro-fuse with a jcc
branch instruction, it's actually worse than the above:
// don't use, sub-optimal
__m256 vcmp = _mm256_cmp_ps(_mm256_setzero_ps(), x, _CMP_LT_OQ);
if (!_mm256_testz_si256(vcmp, vcmp)) {
return 1;
}
The testz
intrinsic is PTEST
. It sets the ZF and CF flags directly based on the results of AND and AND NOT of its args. The testz intrinsic is true when vcmp
has any non-zero bits. (which it will only when vcmpps
puts some there.)
VPTEST
with ymm regs is available with just AVX. AVX2 isn't required even though it looks like a vector-integer instruction.
This will compile to something like
vcmplt_oqps ymm2, ymm0, ymm1
vptest ymm2, ymm2
jnz .true_branch
Probably smaller code-size than the above, but this is actually 4 uops instead of 3. If you were using setnz
or cmovnz
, macro-fusion wouldn't be a factor, so ptest
would be break-even. As I mentioned above, the main use-case for ptest
is when you can use it without a compare instruction, and without AVX.
The alternative for checking a vector for all-zero (pcmpeqb xmm0,xmm1
/ pmovmskb eax, xmm1
/ test eax,eax
) has to destroy one of the input vectors without AVX, so it will require an extra movdqa
instruction to copy if you still need both after the test.
ptest
floating point bit-hacksI think for this specific test, it might be possible to skip the compare instruction and use vptest
directly to see if there are any float
elements with their sign bit unset, but some non-zero bits elsewhere.
Actually no, that idea can't work, because it doesn't respect element boundaries. It couldn't tell the difference between a vector with a positive element vs. a vector with a +0.0
element (sign bit clear) and another element that was negative (other bits set).
vptest
sets CF=bool(~src1 & src2)
and ZF=(src1 & src2)
. I was thinking that src1=set1(0x7FFFFFFF)
could tell us something useful about sign bits and non-sign bits, which we could test with a condition that checks CF and ZF. For example ja
: CF=0 and ZF=0. There actually isn't an x86 condition that's only true with CF=1 and ZF=0, though, so that's another problem.
Also NaN > 0
is false, but NaN has some set bits. (exponent all-ones, mantissa non-zero, sign-bit = don't care so there can be +NaN and -NaN). If that was the only problem, this would still be useful in cases where NaN-handling isn't required.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With