What is the most efficient way to compare two 4x 64bit-Integer AVX vectors for <=
.
From the Intel Intrinsics Guide we have
_mm256_cmpgt_epi64(__m256i a, __m256i b)
= a > b_mm256_cmpeq_epi64(__m256i a, __m256i b)
= a == bfor comparisons
and
_mm256_and_si256(__m256i a, __m256i b)
= a & b_mm256_andnot_si256(__m256i a, __m256i b)
= ~a & b_mm256_or_si256(__m256i a, __m256i b)
= a | b_mm256_xor_si256(__m256i a, __m256i b)
= a ^ bfor logical operations.
My approach was:
// check = ( a <= b ) = ~(a > b) & 0xF..F
__m256i a = ...
__m256i b = ...
__m256i tmp = _mm256_cmpgt_epi64(a, b)
__m256i check = _mm256_andnot_si256(tmp, _mm256_set1_epi64x(-1))
You're right that there's no direct way to get the mask you really want, only an inverted mask: A gt B
= A nle B
.
There's no vector-NOT instruction, so you do need a vector of all-ones as well as an extra instruction to invert a vector. (Or a vector of all-zero and _mm256_cmpeq_epi8
, but that can't run on as many execution ports as _mm256_xor_si256
with an all-ones vector.) See the x86 tag wiki for performance info, esp. Agner Fog's guide.
The other bitwise boolean option, _mm256_andn_si256
is just as good as xor. It isn't commutative, and slightly more complicated to mentally verify that you got it right. xor-with-all-ones is a good idiom for flip-all-the-bits.
Instead of spending an instruction inverting the mask, in most code it's possible to just use it the opposite way.
e.g. if it's an input to a blendv
, then reverse the order of the operands to the blend. Instead of
_mm256_blendv_epi8(a, b, A_le_B_mask)
, use
_mm256_blendv_epi8(b, a, A_nle_B_mask)
If you were going to _mm_and
something with the mask, use _mm_andn
instead.
If you were going to _mm_movemask
and test for all-zero, you can instead test for all-ones. It will compile to a cmp eax, -1
instruction instead of a test eax,eax
, which is just as efficient. If you were going to bitscan for the first 1, you will have to invert it. An integer not
instruction (from using ~
on the movemask result) is cheaper than doing it on the vector.
You only have a problem if you were going to OR or XOR, because those instruction don't come in flavours that negate one of their inputs. (IDK if Intel just didn't want to add a PORN
mnemonic, but probably PAND
and PANDN
get more use, esp. before variable-blend instructions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With