Is it possible to check if any of 2 sets of 3 ints is equal with less than 9 comparisons?

Tags:

int eq3(int a, int b, int c, int d, int e, int f){
    return a == d || a == e || a == f 
        || b == d || b == e || b == f 
        || c == d || c == e || c == f;
}

This function receives 6 ints and returns true if any of the 3 first ints is equal to any of the 3 last ints. Is there any bitwise-hack similar way to make it faster?

937

asked Feb 16 '16 23:02

2 Answers

Assuming you're expecting a high rate of false results you could make a quick "pre-check" to quickly isolate such cases:

If a bit in a is set that isn't set in any of d, e and f then a cannot be equal to any of these.

Thus something like

int pre_eq3(int a, int b, int c, int d, int e, int f){
    int const mask = ~(d | e | f);
    if ((a & mask) && (b & mask) && (c & mask)) {
         return false;
    }
    return eq3(a, b, c, d, e, f);
}

could speed it up (8 operations instead of 9 17, but much more costly if the result will actually be true). If mask == 0 then of course this won't help.

This can be further improved if with high probability a & b & c has some bits set:

int pre_eq3(int a, int b, int c, int d, int e, int f){
    int const mask = ~(d | e | f);
    if ((a & b & c) & mask) {
        return false;
    }
    if ((a & mask) && (b & mask) && (c & mask)) {
         return false;
    }
    return eq3(a, b, c, d, e, f);
}

Now if all of a, b and c have bits set where none of d, e and c have any bits set we're out pretty fast.

200

answered Sep 20 '22 14:09

Daniel Jour

Expanding on dawg's SSE comparison method, you can combine the results of the comparisons using a vector OR, and move a mask of the compare results back to an integer to test for 0 / non-zero.

Also, you can get data into vectors more efficiently (although it's still pretty clunky to get many separate integers into vectors when they're live in registers to start with, rather than sitting in memory).

You should avoid store-forwarding stalls that result from doing three small stores and one big load.

///// UNTESTED ////////
#include <immintrin.h>
int eq3(int a, int b, int c, int d, int e, int f){

    // Use _mm_set to let the compiler worry about getting integers into vectors
    // Use -mtune=intel or gcc will make bad code, though :(
    __m128i abcc = _mm_set_epi32(0,c,b,a);  // args go from high to low position in the vector
    // masking off the high bits of the result-mask to avoid false positives
    // is cheaper than repeating c (to do the same compare twice)

    __m128i dddd = _mm_set1_epi32(d);
    __m128i eeee = _mm_set1_epi32(e);

    dddd = _mm_cmpeq_epi32(dddd, abcc);
    eeee = _mm_cmpeq_epi32(eeee, abcc);  // per element: 0(unequal) or -1(equal)
    __m128i combined = _mm_or_si128(dddd, eeee);

    __m128i ffff = _mm_set1_epi32(f);
    ffff = _mm_cmpeq_epi32(ffff, abcc);
    combined = _mm_or_si128(combined, ffff);

    // results of all the compares are ORed together.  All zero only if there were no hits
    unsigned equal_mask = _mm_movemask_epi8(combined);
    equal_mask &= 0x0fff;  // the high 32b element could have false positives
    return equal_mask;
    // return !!equal_mask if you want to force it to 0 or 1
    // the mask tells you whether it was a, b, or c that had a hit

    // movmskps would return a mask of just 4 bits, one for each 32b element, but might have a bypass delay on Nehalem.
    // actually, pmovmskb apparently runs in the float domain on Nehalem anyway, according to Agner Fog's table >.<
}

This compiles to pretty nice asm, pretty similar between clang and gcc, but clang's -fverbose-asm puts nice comments on the shuffles. Only 19 instructions including the ret, with a decent amount of parallelism from separate dependency chains. With -msse4.1, or -mavx, it saves another couple of instructions. (But probably doesn't run any faster)

With clang, dawg's version is about twice the size. With gcc, something bad happens and it's horrible (over 80 instructions. Looks like a gcc optimization bug, since it looks worse than just a straightforward translation of the source). Even clang's version spends so long getting data into / out of vector regs that it might be faster to just do the comparisons branchlessly and OR the truth values together.

This compiles to decent code:

// 8bit variable doesn't help gcc avoid partial-register stalls even with -mtune=core2 :/
int eq3_scalar(int a, int b, int c, int d, int e, int f){
    char retval = (a == d) | (a == e) | (a == f)
         | (b == d) | (b == e) | (b == f)
         | (c == d) | (c == e) | (c == f);
    return retval;
}

Play around with how to get the data from the caller into vector regs. If the groups of three are coming from memory, then prob. passing pointers so a vector load can get them from their original location is best. Going through integer registers on the way to vectors sucks (higher latency, more uops), but if your data is already live in regs it's a loss to do integer stores and then vector loads. gcc is dumb and follows the AMD optimization guide's recommendation to bounce through memory, even though Agner Fog says he's found that's not worth it even on AMD CPUs. It's definitely worse on Intel, and apparently a wash or maybe still worse on AMD, so it's definitely the wrong choice for -mtune=generic. Anyway...

It's also possible to do 8 of our 9 compares with just two packed-vector compares.

The 9th can be done with an integer compare, and have its truth value ORed with the vector result. On some CPUs (esp. AMD, and maybe Intel Haswell and later) not transferring one of the 6 integers to vector regs at all might be a win. Mixing three integer branchless-compares in with the vector shuffles / compares would interleave them nicely.

These vector comparisons can be set up by using shufps on integer data (since it can combine data from two source registers). That's fine on most CPUs, but requires a lot of annoying casting when using intrinsics instead of actual asm. Even if there is a bypass delay, it's not a bad tradeoff vs. something like punpckldq and then pshufd.

aabb    ccab
====    ====
dede    deff

c==f

with asm something like:

#### untested
# pretend a is in eax, and so on
movd     xmm0, eax
movd     xmm1, ebx
movd     xmm2, ecx

shl      rdx, 32
#mov     edi, edi     # zero the upper 32 of rdi if needed, or use shld instead of OR if you don't care about AMD CPUs
or       rdx, rdi     # de in an integer register.
movq     xmm3, rdx    # de, aka (d<<32)|e
# in 32bit code, use a vector shuffle of some sort to do this in a vector reg, or:
#pinsrd  xmm3, edi, 1  # SSE4.1, and 2 uops (same as movd+shuffle)
#movd    xmm4, edi    # e

movd     xmm5, esi    # f

shufps   xmm0, xmm1, 0            #  xmm0=aabb  (low dword = a; my notation is backwards from left/right vector-shift perspective)
shufps   xmm5, xmm3, 0b01000000   # xmm5 = ffde  
punpcklqdq xmm3, xmm3            # broadcast: xmm3=dede
pcmpeqd  xmm3, xmm0              # xmm3: aabb == dede

# spread these instructions out between vector instructions, if you aren't branching
xor      edx,edx
cmp      esi, ecx     # c == f
#je   .found_match    # if there's one of the 9 that's true more often, make it this one.  Branch mispredicts suck, though
sete     dl

shufps   xmm0, xmm2, 0b00001000  # xmm0 = abcc
pcmpeqd  xmm0, xmm5              # abcc == ffde

por      xmm0, xmm3
pmovmskb eax, xmm0    # will have bits set if cmpeq found any equal elements
or       eax, edx     # combine vector and scalar compares
jnz  .found_match
# or record the result instead of branching on it
setnz    dl

This is also 19 instructions (not counting the final jcc / setcc), but one of them is an xor-zeroing idiom, and there are other simple integer instructions. (Shorter encoding, some can run on port6 on Haswell+ which can't handle vector instructions). There might be a longer dep chain due to the chain of shuffles that builds abcc.

answered Sep 19 '22 14:09

Peter Cordes

Related questions
                            
                                Definition of main() in C [duplicate]
                            
                                break in do while loop
                            
                                __vdso_time missing from core utils?
                            
                                C String Literal "too big for character"
                            
                                Why do we need to add a '\0' (null) at the end of a character array in C?
                            
                                Maximum size of a bit field in C or C++? [duplicate]
                            
                                Debug fork() in eclipse cdt
                            
                                Runtime Memory allocation on stack
                            
                                note: previous implicit declaration of ‘point_forward’ was here
                            
                                __isoc99_scanf and scanf
                            
                                In what languages besides C can I write a C library?
                            
                                open() doesn't set O_CLOEXEC flag
                            
                                Why is h_addr_list in struct hostent a char ** instead of struct in_addr **?
                            
                                Nameless union inside a union
                            
                                Difference in floating point arithmetics between x86 and x64
                            
                                What is MODULE_ALIAS in Linux device driver code?
                            
                                How to struct unpack c null terminated string?
                            
                                Valgrind won't detect buffer overflow
                            
                                How to chain BCryptEncrypt and BCryptDecrypt calls using AES in GCM mode?
                            
                                Does comma separators in type definition in C guarantee the order?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it possible to check if any of 2 sets of 3 ints is equal with less than 9 comparisons?

Tags:

performance

c

bit-manipulation

micro-optimization

MaiaVictor

People also ask

2 Answers

Daniel Jour

It's also possible to do 8 of our 9 compares with just two packed-vector compares.

Peter Cordes

Recent Activity

Donate For Us