Let A
be an array that contains an odd number of zeros and ones. If n
is the size of A
, then A
is constructed such that the first ceil(n/2)
elements are 0
and the remaining elements 1
.
So if n = 9
, A
would look like this:
0,0,0,0,0,1,1,1,1
The goal is to find the sum of 1s
in the array and we do this by using this function:
s = 0; void test1(int curIndex){ //A is 0,0,0,...,0,1,1,1,1,1...,1 if(curIndex == ceil(n/2)) return; if(A[curIndex] == 1) return; test1(curIndex+1); test1(size-curIndex-1); s += A[curIndex+1] + A[size-curIndex-1]; }
This function is rather silly for the problem given, but it's a simulation of a different function that I want to look like this and is producing the same amount of branch mispredictions.
Here is the entire code of the experiment:
#include <iostream> #include <fstream> using namespace std; int size; int *A; int half; int s; void test1(int curIndex){ //A is 0,0,0,...,0,1,1,1,1,1...,1 if(curIndex == half) return; if(A[curIndex] == 1) return; test1(curIndex+1); test1(size - curIndex - 1); s += A[curIndex+1] + A[size-curIndex-1]; } int main(int argc, char* argv[]){ size = atoi(argv[1]); if(argc!=2){ cout<<"type ./executable size{odd integer}"<<endl; return 1; } if(size%2!=1){ cout<<"size must be an odd number"<<endl; return 1; } A = new int[size]; half = size/2; int i; for(i=0;i<=half;i++){ A[i] = 0; } for(i=half+1;i<size;i++){ A[i] = 1; } for(i=0;i<100;i++) { test1(0); } cout<<s<<endl; return 0; }
Compile by typing g++ -O3 -std=c++11 file.cpp
and run by typing ./executable size{odd integer}
.
I am using an Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz with 8 GB of RAM, L1 cache 256 KB, L2 cache 1 MB, L3 cache 6 MB.
Running perf stat -B -e branches,branch-misses ./cachetests 111111
gives me the following:
Performance counter stats for './cachetests 111111': 32,639,932 branches 1,404,836 branch-misses # 4.30% of all branches 0.060349641 seconds time elapsed
if I remove the line
s += A[curIndex+1] + A[size-curIndex-1];
I get the following output from perf:
Performance counter stats for './cachetests 111111': 24,079,109 branches 39,078 branch-misses # 0.16% of all branches 0.027679521 seconds time elapsed
What does that line have to do with branch predictions when it's not even an if statement?
The way I see it, in the first ceil(n/2) - 1
calls of test1()
, both if statements will be false. In the ceil(n/2)-th
call, if(curIndex == ceil(n/2))
will be true. In the remaining n-ceil(n/2)
calls, the first statement will be false, and the second statement will be true.
Why does Intel fail to predict such a simple behavior?
Now let's look at a second case. Suppose that A
now has alternating zeros and ones. We will always start from 0. So if n = 9
A
will look like this:
0,1,0,1,0,1,0,1,0
The function we are going to use is the following:
void test2(int curIndex){ //A is 0,1,0,1,0,1,0,1,.... if(curIndex == size-1) return; if(A[curIndex] == 1) return; test2(curIndex+1); test2(curIndex+2); s += A[curIndex+1] + A[curIndex+2]; }
And here is the entire code of the experiment:
#include <iostream> #include <fstream> using namespace std; int size; int *A; int s; void test2(int curIndex){ //A is 0,1,0,1,0,1,0,1,.... if(curIndex == size-1) return; if(A[curIndex] == 1) return; test2(curIndex+1); test2(curIndex+2); s += A[curIndex+1] + A[curIndex+2]; } int main(int argc, char* argv[]){ size = atoi(argv[1]); if(argc!=2){ cout<<"type ./executable size{odd integer}"<<endl; return 1; } if(size%2!=1){ cout<<"size must be an odd number"<<endl; return 1; } A = new int[size]; int i; for(i=0;i<size;i++){ if(i%2==0){ A[i] = false; } else{ A[i] = true; } } for(i=0;i<100;i++) { test2(0); } cout<<s<<endl; return 0; }
I run perf using the same commands as before:
Performance counter stats for './cachetests2 111111': 28,560,183 branches 54,204 branch-misses # 0.19% of all branches 0.037134196 seconds time elapsed
And removing that line again improved things a little bit:
Performance counter stats for './cachetests2 111111': 28,419,557 branches 16,636 branch-misses # 0.06% of all branches 0.009977772 seconds time elapsed
Now if we analyse the function, if(curIndex == size-1)
will be false n-1
times, and if(A[curIndex] == 1)
will alternate from true to false.
As I see it, both functions should be easy to predict, however this is not the case for the first function. At the same time I am not sure what is happening with that line and why it plays a role in improving branch behavior.
I believe the most common way to avoid branching is to leverage bit parallelism in reducing the total jumps present in your code. The longer the basic blocks, the less often the pipeline is flushed.
Because CPU adopts pipeline to execute instructions, which means when a previous instruction is being executed at some stage (for example, reading values from registers), the next instruction will get executed at the same time, but at another stage (for example, decoding stage).
Once a branch misprediction is discovered, the core is able to restart decoding as soon as the correct path is known, at the same time that the out-of-order machine is clearing out uops from the wrongly speculated path.
Branchless programming is a programming technique that eliminates the branches (if, switch, and other conditional statements) from the program. Although this is not much relevant these days with extremely powerful systems and usage of interpreted languages( especially dynamic typed ones).
Here are my thoughts on this after staring at it for a while. First of all, the issue is easily reproducible with -O2
, so it's better to use that as a reference, as it generates simple non-unrolled code that is easy to analyse. The problem with -O3
is essentially the same, it's just a bit less obvious.
So, for the first case (half-zeros with half-ones pattern) the compiler generates this code:
0000000000400a80 <_Z5test1i>: 400a80: 55 push %rbp 400a81: 53 push %rbx 400a82: 89 fb mov %edi,%ebx 400a84: 48 83 ec 08 sub $0x8,%rsp 400a88: 3b 3d 0e 07 20 00 cmp 0x20070e(%rip),%edi # 60119c <half> 400a8e: 74 4f je 400adf <_Z5test1i+0x5f> 400a90: 48 8b 15 09 07 20 00 mov 0x200709(%rip),%rdx # 6011a0 <A> 400a97: 48 63 c7 movslq %edi,%rax 400a9a: 48 8d 2c 85 00 00 00 lea 0x0(,%rax,4),%rbp 400aa1: 00 400aa2: 83 3c 82 01 cmpl $0x1,(%rdx,%rax,4) 400aa6: 74 37 je 400adf <_Z5test1i+0x5f> 400aa8: 8d 7f 01 lea 0x1(%rdi),%edi 400aab: e8 d0 ff ff ff callq 400a80 <_Z5test1i> 400ab0: 89 df mov %ebx,%edi 400ab2: f7 d7 not %edi 400ab4: 03 3d ee 06 20 00 add 0x2006ee(%rip),%edi # 6011a8 <size> 400aba: e8 c1 ff ff ff callq 400a80 <_Z5test1i> 400abf: 8b 05 e3 06 20 00 mov 0x2006e3(%rip),%eax # 6011a8 <size> 400ac5: 48 8b 15 d4 06 20 00 mov 0x2006d4(%rip),%rdx # 6011a0 <A> 400acc: 29 d8 sub %ebx,%eax 400ace: 48 63 c8 movslq %eax,%rcx 400ad1: 8b 44 2a 04 mov 0x4(%rdx,%rbp,1),%eax 400ad5: 03 44 8a fc add -0x4(%rdx,%rcx,4),%eax 400ad9: 01 05 b9 06 20 00 add %eax,0x2006b9(%rip) # 601198 <s> 400adf: 48 83 c4 08 add $0x8,%rsp 400ae3: 5b pop %rbx 400ae4: 5d pop %rbp 400ae5: c3 retq 400ae6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 400aed: 00 00 00
Very simple, kind of what you would expect -- two conditional branches, two calls. It gives us this (or similar) statistics on Core 2 Duo T6570, AMD Phenom II X4 925 and Core i7-4770:
$ perf stat -B -e branches,branch-misses ./a.out 111111 5555500 Performance counter stats for './a.out 111111': 45,216,754 branches 5,588,484 branch-misses # 12.36% of all branches 0.098535791 seconds time elapsed
If you're to make this change, moving assignment before recursive calls:
--- file.cpp.orig 2016-09-22 22:59:20.744678438 +0300 +++ file.cpp 2016-09-22 22:59:36.492583925 +0300 @@ -15,10 +15,10 @@ if(curIndex == half) return; if(A[curIndex] == 1) return; + s += A[curIndex+1] + A[size-curIndex-1]; test1(curIndex+1); test1(size - curIndex - 1); - s += A[curIndex+1] + A[size-curIndex-1]; }
The picture changes:
$ perf stat -B -e branches,branch-misses ./a.out 111111 5555500 Performance counter stats for './a.out 111111': 39,495,804 branches 54,430 branch-misses # 0.14% of all branches 0.039522259 seconds time elapsed
And yes, as was already noted it's directly related to tail recursion optimisation, because if you're to compile the patched code with -fno-optimize-sibling-calls
you will get the same "bad" results. So let's look at what do we have in assembly with tail call optimization:
0000000000400a80 <_Z5test1i>: 400a80: 3b 3d 16 07 20 00 cmp 0x200716(%rip),%edi # 60119c <half> 400a86: 53 push %rbx 400a87: 89 fb mov %edi,%ebx 400a89: 74 5f je 400aea <_Z5test1i+0x6a> 400a8b: 48 8b 05 0e 07 20 00 mov 0x20070e(%rip),%rax # 6011a0 <A> 400a92: 48 63 d7 movslq %edi,%rdx 400a95: 83 3c 90 01 cmpl $0x1,(%rax,%rdx,4) 400a99: 74 4f je 400aea <_Z5test1i+0x6a> 400a9b: 8b 0d 07 07 20 00 mov 0x200707(%rip),%ecx # 6011a8 <size> 400aa1: eb 15 jmp 400ab8 <_Z5test1i+0x38> 400aa3: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 400aa8: 48 8b 05 f1 06 20 00 mov 0x2006f1(%rip),%rax # 6011a0 <A> 400aaf: 48 63 d3 movslq %ebx,%rdx 400ab2: 83 3c 90 01 cmpl $0x1,(%rax,%rdx,4) 400ab6: 74 32 je 400aea <_Z5test1i+0x6a> 400ab8: 29 d9 sub %ebx,%ecx 400aba: 8d 7b 01 lea 0x1(%rbx),%edi 400abd: 8b 54 90 04 mov 0x4(%rax,%rdx,4),%edx 400ac1: 48 63 c9 movslq %ecx,%rcx 400ac4: 03 54 88 fc add -0x4(%rax,%rcx,4),%edx 400ac8: 01 15 ca 06 20 00 add %edx,0x2006ca(%rip) # 601198 <s> 400ace: e8 ad ff ff ff callq 400a80 <_Z5test1i> 400ad3: 8b 0d cf 06 20 00 mov 0x2006cf(%rip),%ecx # 6011a8 <size> 400ad9: 89 c8 mov %ecx,%eax 400adb: 29 d8 sub %ebx,%eax 400add: 89 c3 mov %eax,%ebx 400adf: 83 eb 01 sub $0x1,%ebx 400ae2: 39 1d b4 06 20 00 cmp %ebx,0x2006b4(%rip) # 60119c <half> 400ae8: 75 be jne 400aa8 <_Z5test1i+0x28> 400aea: 5b pop %rbx 400aeb: c3 retq 400aec: 0f 1f 40 00 nopl 0x0(%rax)
It has four conditional branches with one call. So let's analyse the data we've got so far.
First of all, what is a branching instruction from the processor perspective? It's any of call
, ret
, j*
(including direct jmp
) and loop
. call
and jmp
are a bit unintuitive, but they are crucial to count things correctly.
Overall, we expect this function to be called 11111100 times, one for each element, that's roughly 11M. In non-tail-call-optimized version we see about 45M branches, initialization in main() is just 111K, all the other things are minor, so the main contribution to this number comes from our function. Our function is call
-ed, it evaluates the first je
, which is true in all cases except one, then it evaluates the second je
, which is true half of the times and then it either calls itself recursively (but we've already counted that the function is invoked 11M times) or returns (as it does after recursive calls. So that's 4 branching instructions per 11M calls, exactly the number we see. Out of these around 5.5M branches are missed, that suggests that these misses all come from one mispredicted instruction, either something that's evaluated 11M times and missed around 50% of the time or something that's evaluated half of the time and missed always.
What do we have in tail-call-optimized version? We have the function called around 5.5M times, but now each invocation incurs one call
, two branches initially (first one is true in all cases except one and the second one is always false because of our data), then a jmp
, then a call (but we've already counted that we have 5.5M calls), then a branch at 400ae8
and a branch at 400ab6
(always true because of our data), then return. So, on average that's four conditional branches, one unconditional jump, a call and one indirect branch (return from function), 5.5M times 7 gives us an overall count of around 39M branches, exactly as we see in the perf output.
What we know is that the processor has no problem at all predicting things in a flow with one function call (even though this version has more conditional branches) and it has problems with two function calls. So it suggests that the problem is in returns from the function.
Unfortunately, we know very little about the details of how exactly branch predictors of our modern processors work. The best analysis that I could find is this and it suggests that the processors have a return stack buffer of around 16 entries. If we're to return to our data again with this finding at hand things start to clarify a bit.
When you have half-zeroes with half-ones pattern, you're recursing very deeply into test1(curIndex+1)
, but then you start returning back and calling test1(size-curIndex-1)
. That recursion is never deeper than one call, so the returns are predicted perfectly for it. But remember that we're now 55555 invocations deep and the processor only remembers last 16, so it's not surprising that it can't guess our returns starting from 55539-level deep, it's more surprising that it can do so with tail-call-optimized version.
Actually, the behaviour of tail-call-optimized version suggests that missing any other information about returns, the processor just assumes that the right one is the last one seen. It's also proven by the behaviour of non-tail-call-optimized version, because it goes 55555 calls deep into the test1(curIndex+1)
and then upon return it always gets one level deep into test1(size-curIndex-1)
, so when we're up from 55555-deep to 55539-deep (or whatever your processor return buffer is) it calls into test1(size-curIndex-1)
, returns from that and it has absolutely no information about the next return, so it assumes that we're to return to the last seen address (which is the address to return to from test1(size-curIndex-1)
) and it's obviously wrong. 55539 times wrong. With 100 cycles of the function, that's exactly the 5.5M branch prediction misses we see.
Now let's get to your alternating pattern and the code for that. This code is actually very different, if you're to analyse how it goes into the depth. Here you have your test2(curIndex+1)
always return immediately and your test2(curIndex+2)
to always go deeper. So the returns from test2(curIndex+1)
are always predicted perfectly (they just don't go deep enough) and when we're to finish our recursion into test2(curIndex+2)
, it always returns to the same point, all 55555 times, so the processor has no problems with that.
This can further be proven by this little change to your original half-zeroes with half-ones code:
--- file.cpp.orig 2016-09-23 11:00:26.917977032 +0300 +++ file.cpp 2016-09-23 11:00:31.946027451 +0300 @@ -15,8 +15,8 @@ if(curIndex == half) return; if(A[curIndex] == 1) return; - test1(curIndex+1); test1(size - curIndex - 1); + test1(curIndex+1); s += A[curIndex+1] + A[size-curIndex-1];
So now the code generated is still not tail-call optimized (assembly-wise it's very similar to the original), but you get something like this in the perf output:
$ perf stat -B -e branches,branch-misses ./a.out 111111 5555500 Performance counter stats for './a.out 111111': 45 308 579 branches 75 927 branch-misses # 0,17% of all branches 0,026271402 seconds time elapsed
As expected, now our first call always returns immediately and the second call goes 55555-deep and then only returns to the same point.
Now with that solved let me show something up my sleeve. On one system, and that is Core i5-5200U the non-tail-call-optimized original half-zeroes with half-ones version shows this results:
$ perf stat -B -e branches,branch-misses ./a.out 111111 5555500 Performance counter stats for './a.out 111111': 45 331 670 branches 16 349 branch-misses # 0,04% of all branches 0,043351547 seconds time elapsed
So, apparently, Broadwell can handle this pattern easily, which returns us to the question of how much do we know about branch prediction logic of our modern processors.
Removing the line s += A[curIndex+1] + A[size-curIndex-1];
enables tail recursive optimization. This optimization can only happen then the recursive call is in the last line of the function.
https://en.wikipedia.org/wiki/Tail_call
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With