Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Benchmarking C struct comparsion: XOR vs ==

Say we have a simple struct in C that has 4 fields:

typedef struct {
    int a;
    int b;
    int c;
    int d;
} value_st;

Let's take a look at these two short versions of C struct equal check.

The first one is straight-forward and does the following:

int compare1(const value_st *x1, const value_st *x2) {
    return ( (x1->a == x2->a) && (x1->b == x2->b) &&
             (x1->c == x2->c) && (x1->d == x2->d) );
}

The second one uses XOR:

int compare2(const value_st *x1, const value_st *x2) {
    return ( (x1->a ^ x2->a) | (x1->b ^ x2->b) |
             (x1->c ^ x2->c) | (x1->d ^ x2->d); 
}

The first version will return nonzero if both structs are equal.

and the second version will return zero iff the two structs are equal.

Compiler Output

Compiling with GCC -O2 and examining the assembly looks like what we expect.

The first version is 4 CMP instructions and JMPS:

xor    %eax,%eax
mov    (%rsi),%edx
cmp    %edx,(%rdi)
je     0x9c0 <compare1+16>
repz   retq 
nopw   0x0(%rax,%rax,1)
mov    0x4(%rsi),%ecx
cmp    %ecx,0x4(%rdi)
jne    0x9b8 <compare1+8>
mov    0x8(%rsi),%ecx
cmp    %ecx,0x8(%rdi)
jne    0x9b8 <compare1+8>
mov    0xc(%rsi),%eax
cmp    %eax,0xc(%rdi)
sete   %al
movzbl %al,%eax
retq   

The second version looks like this:

mov    (%rdi),%eax
mov    0x4(%rdi),%edx
xor    (%rsi),%eax
xor    0x4(%rsi),%edx
or     %edx,%eax
mov    0x8(%rdi),%edx
xor    0x8(%rsi),%edx
or     %edx,%eax
mov    0xc(%rdi),%edx
xor    0xc(%rsi),%edx
or     %edx,%eax
retq   

So the second version has:

  • no branches
  • less instructions

Benchmarking

static uint64_t
now_msec() {
    struct timespec spec;
    clock_gettime(CLOCK_MONOTONIC, &spec);
    return ((uint64_t)spec.tv_sec * 1000) + (spec.tv_nsec / 1000000);
}

void benchmark() {
    uint64_t start = now_msec();
    uint64_t sum = 0;

    for (uint64_t i = 0; i < 1e10; i++) {
        if (compare1(&x1, &x2)) {
            sum++;
        }
    }

    uint64_t delta_ms = now_msec() - start;

    // use sum and delta here
}

Enough iterations to filter out the time it takes to call clock_gettime()

But here is the thing I don't get...

When I benchmark equal structs where all the instructions need to be executed, the first version is faster...

time took for compare == is 3114 [ms]   [matches: 10000000000]
time took for compare XOR is 3177 [ms]  [matches: 10000000000]

How is this possible ? Even with branch prediction, XOR should be super fast instruction and not lose to CMP/JMP

Update

Couple of important notes:

  • This question is mainly to understand the outcome. not to try to beat the compiler or create an obscure code - it is always better to write clean code and let the compiler optimize
  • We assume the structs are in the cache, otherwise the dominating factor will be obviously the memory lookup
  • Branch prediction will obviously play a part...but can it be better than branchless code (given that most of the time we execute all the code) ?
  • memcmp will require zero padding in the struct and also might need a loop / if in most standard implementations, as it supports variable size comparison

Update 2

  • Many have stated that the difference is tiny per call...this is true but is consistent which means that this difference is in favor of the first version in many consecutive runs

Update 3

I've copied my test code to a lab server with a Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz The XOR version runs almost two times faster on the server for GCC 8.

Tried with both clang and GCC 8:

For GCC 8:

time took for compare == is 7432 [ms]    [matches: 3000000000]
time took for compare XOR is 4214 [ms]   [matches: 3000000000]

for Clang:

time took for compare == is 4265 [ms]    [matches: 3000000000]
time took for compare XOR is 5508 [ms]   [matches: 3000000000]

So it seems like this is very compiler and CPU dependent.

like image 243
Itay Marom Avatar asked Oct 29 '25 17:10

Itay Marom


2 Answers

Well, in the first case there are 4 mov's and 4 cmp's. In the second case there are 4 mov's, 4 xor's and 4 or's. As jmp's not taken take in effect no time, the first version is faster. (cmp and xor do basically the same thing and should execute in the same amount of time)

The moral of the story here is that you should never try to outsmart your compiler, it really knows better (at least in 99.99% of cases)

And never obscure the intent of your program in an effort to make it faster, unless you have hard evidence it is (1) needed and (2) effective.

like image 101
koder Avatar answered Nov 01 '25 07:11

koder


time took for compare == is 3114 [ms]   [matches: 10000000000]
time took for compare XOR is 3177 [ms]  [matches: 10000000000]

How is this possible ?

Because actual execution time is affected by many factors out of your control, which is why you should never rely on a single run of a benchmarking program to make any decisions. Run it many times, under different load conditions, and average the results.

Secondly, this run shows a difference of 63 milliseconds out of a little over 3 seconds, or 2%, for one billion comparisons between the two methods. As far as a person sitting in front of the screen is concerned, that's barely noticable. If your results consistently showed a difference of a full second or more that would be worth investigating, but this is down in the noise.

And finally, what is going to be the more common operation in the real code - comparing identical structs or non-identical structs? If the second case is going to be more common, even if just by a bare majority of 51%, then the == method will be significantly faster on average due to short-circuiting.

When optimizing code, look at the big picture - don't hyperfocus on a single operation. You'll wind up writing code that's hard to read, harder to maintain, and probably not as optimized as you think it is.

like image 22
John Bode Avatar answered Nov 01 '25 08:11

John Bode



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!