Slow XOR operator

Question

EDIT: Indeed, I had a weird error in my timing code leading to these results. When I fixed my error, the smart version ended up faster as expected. My timing code looked like this:

bool x = false;
before = now();
for (int i=0; i<N; ++i) {
  x ^= smart_xor(A[i],B[i]);
}
after = now();

I had done the ^= to discourage my compiler from optimizing the for-loop away. But I think that the ^= somehow interacts strangely with the two xor functions. I changed my timing code to simply fill out an array of the xor results, and then do computation with that array outside of the timed code. And that fixed things.

Should I delete this question?

END EDIT

I defined two C++ functions as follows:

bool smart_xor(bool a, bool b) {
  return a^b;
}

bool dumb_xor(bool a, bool b) {
  return a?!b:b;
}

My timing tests indicate that dumb_xor() is slightly faster (1.31ns vs 1.90ns when inlined, 1.92ns vs 2.21ns when not inlined). This puzzles me, as the ^ operator should be a single machine operation. I'm wondering if anyone has an explanation.

The assembly looks like this (when not inlined):

    .file   "xor.cpp"
    .text
    .p2align 4,,15
.globl _Z9smart_xorbb
    .type   _Z9smart_xorbb, @function
_Z9smart_xorbb:
.LFB0:
    .cfi_startproc
    .cfi_personality 0x3,__gxx_personality_v0
    movl    %esi, %eax
    xorl    %edi, %eax
    ret
    .cfi_endproc
.LFE0:
    .size   _Z9smart_xorbb, .-_Z9smart_xorbb
    .p2align 4,,15
.globl _Z8dumb_xorbb
    .type   _Z8dumb_xorbb, @function
_Z8dumb_xorbb:
.LFB1:
    .cfi_startproc
    .cfi_personality 0x3,__gxx_personality_v0
    movl    %esi, %edx
    movl    %esi, %eax
    xorl    $1, %edx
    testb   %dil, %dil
    cmovne  %edx, %eax
    ret
    .cfi_endproc
.LFE1:
    .size   _Z8dumb_xorbb, .-_Z8dumb_xorbb
    .ident  "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
    .section        .note.GNU-stack,"",@progbits

I'm using g++ 4.4.3-4ubuntu5 on an Intel Xeon X5570. I compiled with -O3.

Macmade · Accepted Answer

I don't think you benchmarked your code correctly.

We can see in the generated assembly that your smart_xor function is:

movl    %esi, %eax
xorl    %edi, %eax

while your dumb_xor function is:

movl    %esi, %edx
movl    %esi, %eax
xorl    $1, %edx
testb   %dil, %dil
cmovne  %edx, %eax

So obviously, the first one will be faster.
If not, then you have benchmarking issues.

So you may want to tune your benchmarking code... And remember you'll need to run a lot of calls to have a good and meaningful average.

Mats Petersson · Answer

Given that your "dumb XOR" code is significantly longer (and most instructions are dependent on a previous one, so it won't run in parallel), I suspect that you have some sort of measurement error in your results.

The compiler will need to produce two instructions for the out-of-line version of "smart XOR" because the registers that the data comes in as is not the register to give the return result in, so the data has to move from EDI and ESI to EAX. In an inline version, the code should be able to use whatever register the data is in before the call, and if the code allows it, result stays in the register it came in as.

Calling a function is out-of-line is probably at least as long in execution time as the actual code in the function.

It would help if you showes your test-harness that you use for benchmarking too...

Slow XOR operator

Tags:

c++

optimization

xor

dshin

2 Answers

Macmade

Mats Petersson

Recent Activity

Donate For Us

Slow XOR operator

Tags:

c++

optimization

xor

dshin

2 Answers

Macmade

Mats Petersson

Related questions

Recent Activity

Donate For Us