EDIT: Indeed, I had a weird error in my timing code leading to these results. When I fixed my error, the smart version ended up faster as expected. My timing code looked like this:
bool x = false;
before = now();
for (int i=0; i<N; ++i) {
x ^= smart_xor(A[i],B[i]);
}
after = now();
I had done the ^=
to discourage my compiler from optimizing the for-loop away. But I think that the ^=
somehow interacts strangely with the two xor functions. I changed my timing code to simply fill out an array of the xor results, and then do computation with that array outside of the timed code. And that fixed things.
Should I delete this question?
END EDIT
I defined two C++ functions as follows:
bool smart_xor(bool a, bool b) {
return a^b;
}
bool dumb_xor(bool a, bool b) {
return a?!b:b;
}
My timing tests indicate that dumb_xor()
is slightly faster (1.31ns vs 1.90ns when inlined, 1.92ns vs 2.21ns when not inlined). This puzzles me, as the ^
operator should be a single machine operation. I'm wondering if anyone has an explanation.
The assembly looks like this (when not inlined):
.file "xor.cpp"
.text
.p2align 4,,15
.globl _Z9smart_xorbb
.type _Z9smart_xorbb, @function
_Z9smart_xorbb:
.LFB0:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
movl %esi, %eax
xorl %edi, %eax
ret
.cfi_endproc
.LFE0:
.size _Z9smart_xorbb, .-_Z9smart_xorbb
.p2align 4,,15
.globl _Z8dumb_xorbb
.type _Z8dumb_xorbb, @function
_Z8dumb_xorbb:
.LFB1:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
movl %esi, %edx
movl %esi, %eax
xorl $1, %edx
testb %dil, %dil
cmovne %edx, %eax
ret
.cfi_endproc
.LFE1:
.size _Z8dumb_xorbb, .-_Z8dumb_xorbb
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",@progbits
I'm using g++ 4.4.3-4ubuntu5 on an Intel Xeon X5570. I compiled with -O3.
I don't think you benchmarked your code correctly.
We can see in the generated assembly that your smart_xor
function is:
movl %esi, %eax
xorl %edi, %eax
while your dumb_xor
function is:
movl %esi, %edx
movl %esi, %eax
xorl $1, %edx
testb %dil, %dil
cmovne %edx, %eax
So obviously, the first one will be faster.
If not, then you have benchmarking issues.
So you may want to tune your benchmarking code... And remember you'll need to run a lot of calls to have a good and meaningful average.
Given that your "dumb XOR" code is significantly longer (and most instructions are dependent on a previous one, so it won't run in parallel), I suspect that you have some sort of measurement error in your results.
The compiler will need to produce two instructions for the out-of-line version of "smart XOR" because the registers that the data comes in as is not the register to give the return result in, so the data has to move from EDI and ESI to EAX. In an inline version, the code should be able to use whatever register the data is in before the call, and if the code allows it, result stays in the register it came in as.
Calling a function is out-of-line is probably at least as long in execution time as the actual code in the function.
It would help if you showes your test-harness that you use for benchmarking too...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With