What is the best way for pairwise comparison of two integer registers and extraction of equal elements using SSE instructions? For example, if a = [6 4 7 2] and b = [2 4 9 2] (each register contains four 32-bit integers), the result should be [4 2 x x]. An alternative form of this question is how to obtain a binary mask of equal elements (..0101b) that can be used for shuffling or as an index to lookup a parameter for shuffling instruction in the precomputed table.
It is not possible to extract and move equal elements with one instruction. But a mask of equal elements can easily be achieved with pcmpeqd:
__m128i zero = _mm_set1_epi32(0);
__m128i a = _mm_set_epi32(6, 4, 7, 2);
__m128i b = _mm_set_epi32(2, 4, 9, 2);
__m128i mask = _mm_cmp_epi32(a, b); // mask is now 0, -1, 0, -1
mask = _mm_sub_epi32(zero, mask); // mask is now 0, 1, 0, 1
Edit: If you want some index for a lookup table with shuffle constants, you need additional operations. Like
static const __m128i zero = _mm_set1_epi32(0);
static const __m128i bits = _mm_set_epi32(1,2,4,8);
__m128i a = _mm_set_epi32(6, 4, 7, 2);
__m128i b = _mm_set_epi32(2, 4, 9, 2);
__m128i bitvector = _mm_and_si128(bits, _mm_cmp_epi32(a, b));
bitvector = _mm_hadd_epi32(bitvector, bitvector);
bitvector = _mm_hadd_epi32(bitvector, bitvector);
// now a index from 0...15 is the the low 32 bit of bitvector
There might be better algorithms than using a lookup table for computing the shuffle, possibly calculating the shuffle directly using a De Bruijn mulitiplication. OTOH if you have more than 4 ints to compare, additional 4 int's would only come at the cost of one additional phaddd.
I would probably use a variant of what drhirsch proposes:
int index = _mm_movemask_ps((__m128)_mm_cmp_epi32(a, b));
This gives you the same index to use in looking up a shuffle mask using only two operations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With