I have a Compare()
function that looks like this:
inline bool Compare(bool greater, int p1, int p2) {
if (greater) return p1>=p2;
else return p1<=p2;
}
I decided to optimize to avoid branching:
inline bool Compare2(bool greater, int p1, int p2) {
bool ret[2] = {p1<=p2,p1>=p2};
return ret[greater];
}
I then tested by doing this:
bool x = true;
int M = 100000;
int N = 100;
bool a[N];
int b[N];
int c[N];
for (int i=0;i<N; ++i) {
a[i] = rand()%2;
b[i] = rand()%128;
c[i] = rand()%128;
}
// Timed the below loop with both Compare() and Compare2()
for (int j=0; j<M; ++j) {
for (int i=0; i<N; ++i) {
x ^= Compare(a[i],b[i],c[i]);
}
}
The results:
Compare(): 3.14ns avg
Compare2(): 1.61ns avg
I would say case-closed, avoid branching FTW. But for completeness, I replaced
a[i] = rand()%2;
with:
a[i] = true;
and got the exact same measurement of ~3.14ns. Presumably, there is no branching going on then, and the compiler is actually rewriting Compare()
to avoid the if
statement. But then, why is Compare2()
faster?
Unfortunately, I am assembly-code-illiterate, otherwise I would have tried to answer this myself.
EDIT: Below is some assembly:
_Z7Comparebii:
.LFB4:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
movl %edi, %eax
movl %esi, -8(%rbp)
movl %edx, -12(%rbp)
movb %al, -4(%rbp)
cmpb $0, -4(%rbp)
je .L2
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setge %al
jmp .L3
.L2:
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setle %al
.L3:
leave
ret
.cfi_endproc
.LFE4:
.size _Z7Comparebii, .-_Z7Comparebii
.section .text._Z8Compare2bii,"axG",@progbits,_Z8Compare2bii,comdat
.weak _Z8Compare2bii
.type _Z8Compare2bii, @function
_Z8Compare2bii:
.LFB5:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
movl %edi, %eax
movl %esi, -24(%rbp)
movl %edx, -28(%rbp)
movb %al, -20(%rbp)
movw $0, -16(%rbp)
movl -24(%rbp), %eax
cmpl -28(%rbp), %eax
setle %al
movb %al, -16(%rbp)
movl -24(%rbp), %eax
cmpl -28(%rbp), %eax
setge %al
movb %al, -15(%rbp)
movzbl -20(%rbp), %eax
cltq
movzbl -16(%rbp,%rax), %eax
leave
ret
.cfi_endproc
.LFE5:
.size _Z8Compare2bii, .-_Z8Compare2bii
.text
Now, the actual code that performs the test might be using inlined versions of the above two functions, so there is a possibility this might be the wrong code to analyze. With that said, I see a jmp
command in Compare()
, so I think that means that it is branching. If so, I guess this question becomes: why does the branch predictor not improve the performance of Compare()
when I change a[i]
from rand()%2
to true
(or false
for that matter)?
EDIT2: I replaced "branch prediction" with "branching" to make my post more sensible.
Optimization is a program transformation technique, which tries to improve the code by making it consume less resources (i.e. CPU, Memory) and deliver high speed. In optimization, high-level general programming constructs are replaced by very efficient low-level programming codes.
I think I figured most of this out.
When I posted the assembly for the functions in my OP edit, I noted that the inlined version might be different. I hadn't examined or posted the timing code because it was hairier, and because I thought that the process of inlining would not change whether or not branching takes place in Compare()
.
When I un-inlined the function and repeated my measurements, I got the following results:
Compare(): 7.18ns avg
Compare2(): 3.15ns avg
Then, when I replaced a[i]=rand()%2
with a[i]=false
, I got the following:
Compare(): 2.59ns avg
Compare2(): 3.16ns avg
This demonstrates the gain from branch prediction. The fact that the a[i]
substitution yielded no improvement originally shows that inlining removed the branch.
So the last piece of the mystery is why the inlined Compare2()
outperforms the inlined Compare()
. I suppose I could post the assembly for the timing code. It seems plausible enough that some quirk in how functions get inlined might lead to this, so I'm content to end my investigation here. I will be replacing Compare()
with Compare2()
in my application.
Thanks for the many helpful comments.
EDIT: I should add that the probable reason that Compare2
beats all others is that the processor is able to perform both comparisons in parallel. This was the intuition which led me to write the function the way I did. All other variants essentially require two logically serial operations.
I wrote a C++ library called Celero designed to test just such optimizations and alternatives. (Shameless self promotion: https://github.com/DigitalInBlue/Celero)
I ran your cases using the following code:
class StackOverflowFixture : public celero::TestFixture
{
public:
StackOverflowFixture()
{
}
inline bool NoOp(bool greater, int p1, int p2)
{
return true;
}
inline bool Compare(bool greater, int p1, int p2)
{
if(greater == true)
{
return p1>=p2;
}
return p1<=p2;
}
inline bool Compare2(bool greater, int p1, int p2)
{
bool ret[2] = {p1<=p2,p1>=p2};
return ret[greater];
}
inline bool Compare3(bool greater, int p1, int p2)
{
return (!greater != !(p1 <= p2)) | (p1 == p2);
}
inline bool Compare4(bool greater, int p1, int p2)
{
return (greater ^ (p1 <= p2)) | (p1 == p2);
}
};
BASELINE_F(StackOverflow, Baseline, StackOverflowFixture, 100, 5000000)
{
celero::DoNotOptimizeAway(NoOp(rand()%2, rand(), rand()));
}
BENCHMARK_F(StackOverflow, Compare, StackOverflowFixture, 100, 5000000)
{
celero::DoNotOptimizeAway(Compare(rand()%2, rand(), rand()));
}
BENCHMARK_F(StackOverflow, Compare2, StackOverflowFixture, 100, 5000000)
{
celero::DoNotOptimizeAway(Compare2(rand()%2, rand(), rand()));
}
BENCHMARK_F(StackOverflow, Compare3, StackOverflowFixture, 100, 5000000)
{
celero::DoNotOptimizeAway(Compare3(rand()%2, rand(), rand()));
}
BENCHMARK_F(StackOverflow, Compare4, StackOverflowFixture, 100, 5000000)
{
celero::DoNotOptimizeAway(Compare4(rand()%2, rand(), rand()));
}
The results are shown below:
[==========]
[ CELERO ]
[==========]
[ STAGE ] Baselining
[==========]
[ RUN ] StackOverflow.Baseline -- 100 samples, 5000000 calls per run.
[ DONE ] StackOverflow.Baseline (0.690499 sec) [5000000 calls in 690499 usec] [0.138100 us/call] [7241140.103027 calls/sec]
[==========]
[ STAGE ] Benchmarking
[==========]
[ RUN ] StackOverflow.Compare -- 100 samples, 5000000 calls per run.
[ DONE ] StackOverflow.Compare (0.782818 sec) [5000000 calls in 782818 usec] [0.156564 us/call] [6387180.672902 calls/sec]
[ BASELINE ] StackOverflow.Compare 1.133699
[ RUN ] StackOverflow.Compare2 -- 100 samples, 5000000 calls per run.
[ DONE ] StackOverflow.Compare2 (0.700767 sec) [5000000 calls in 700767 usec] [0.140153 us/call] [7135039.178500 calls/sec]
[ BASELINE ] StackOverflow.Compare2 1.014870
[ RUN ] StackOverflow.Compare3 -- 100 samples, 5000000 calls per run.
[ DONE ] StackOverflow.Compare3 (0.709471 sec) [5000000 calls in 709471 usec] [0.141894 us/call] [7047504.408214 calls/sec]
[ BASELINE ] StackOverflow.Compare3 1.027476
[ RUN ] StackOverflow.Compare4 -- 100 samples, 5000000 calls per run.
[ DONE ] StackOverflow.Compare4 (0.712940 sec) [5000000 calls in 712940 usec] [0.142588 us/call] [7013212.893091 calls/sec]
[ BASELINE ] StackOverflow.Compare4 1.032500
[==========]
[ COMPLETE ]
[==========]
Given this test, it looks like Compare2 is the best option for this micro-optimization.
EDIT:
Compare2 Assembly (The best case):
cmp r8d, r9d
movzx eax, dl
setle BYTE PTR ret$[rsp]
cmp r8d, r9d
setge BYTE PTR ret$[rsp+1]
movzx eax, BYTE PTR ret$[rsp+rax]
Compare3 Assembly (The next-best case):
xor r11d, r11d
cmp r8d, r9d
mov r10d, r11d
setg r10b
test dl, dl
mov ecx, r11d
sete cl
mov eax, r11d
cmp ecx, r10d
setne al
cmp r8d, r9d
sete r11b
or eax, r11d
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With