AVX2 integer comparison for smaller equal

Question

What is the most efficient way to compare two 4x 64bit-Integer AVX vectors for <=.

From the Intel Intrinsics Guide we have

_mm256_cmpgt_epi64(__m256i a, __m256i b) = a > b
_mm256_cmpeq_epi64(__m256i a, __m256i b) = a == b

for comparisons

and

_mm256_and_si256(__m256i a, __m256i b) = a & b
_mm256_andnot_si256(__m256i a, __m256i b) = ~a & b
_mm256_or_si256(__m256i a, __m256i b) = a | b
_mm256_xor_si256(__m256i a, __m256i b) = a ^ b

for logical operations.

My approach was:

// check = ( a <= b ) = ~(a > b) & 0xF..F
__m256i a = ...
__m256i b = ...
__m256i tmp = _mm256_cmpgt_epi64(a, b)
__m256i check = _mm256_andnot_si256(tmp, _mm256_set1_epi64x(-1))

Peter Cordes · Accepted Answer

You're right that there's no direct way to get the mask you really want, only an inverted mask: A gt B = A nle B.

There's no vector-NOT instruction, so you do need a vector of all-ones as well as an extra instruction to invert a vector. (Or a vector of all-zero and _mm256_cmpeq_epi8, but that can't run on as many execution ports as _mm256_xor_si256 with an all-ones vector.) See the x86 tag wiki for performance info, esp. Agner Fog's guide.

The other bitwise boolean option, _mm256_andn_si256 is just as good as xor. It isn't commutative, and slightly more complicated to mentally verify that you got it right. xor-with-all-ones is a good idiom for flip-all-the-bits.

Instead of spending an instruction inverting the mask, in most code it's possible to just use it the opposite way.

e.g. if it's an input to a blendv, then reverse the order of the operands to the blend. Instead of
_mm256_blendv_epi8(a, b, A_le_B_mask), use
_mm256_blendv_epi8(b, a, A_nle_B_mask)

If you were going to _mm_and something with the mask, use _mm_andn instead.

If you were going to _mm_movemask and test for all-zero, you can instead test for all-ones. It will compile to a cmp eax, -1 instruction instead of a test eax,eax, which is just as efficient. If you were going to bitscan for the first 1, you will have to invert it. An integer not instruction (from using ~ on the movemask result) is cheaper than doing it on the vector.

You only have a problem if you were going to OR or XOR, because those instruction don't come in flavours that negate one of their inputs. (IDK if Intel just didn't want to add a PORN mnemonic, but probably PAND and PANDN get more use, esp. before variable-blend instructions.

AVX2 integer comparison for smaller equal

Tags:

c

integer

compare

avx

avx2

user2399267......seems good

1 Answers

Peter Cordes

Recent Activity

Donate For Us

AVX2 integer comparison for smaller equal

Tags:

c

integer

compare

avx

avx2

user2399267......seems good

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us