Branch Prediction: Writing Code to Understand it; Getting Weird Results

Tags:

branch-prediction

I'm trying to get a good understanding of branch prediction by measuring the time to run loops with predictable branches vs. loops with random branches.

So I wrote a program that takes large arrays of 0's and 1's arranged in different orders (i.e. all 0's, repeating 0-1, all rand), and iterates through the array branching based on if the current index is 0 or 1, doing time-wasting work.

I expected that harder-to-guess arrays would take longer to run on, since the branch predictor would guess wrong more often, and that the time-delta between runs on two sets of arrays would remain the same regardless of the amount of time-wasting work.

However, as amount of time-wasting work increased, the difference in time-to-run between arrays increased, A LOT.

Yo this graph makes no sense

(X-axis is amount of time-wasting work, Y-axis is time-to-run)

Does anyone understand this behavior? You can see the code I'm running at the following code:

#include <stdlib.h>
#include <time.h>
#include <chrono>
#include <stdio.h>
#include <iostream>
#include <vector>
using namespace std;
static const int s_iArrayLen = 999999;
static const int s_iMaxPipelineLen = 60;
static const int s_iNumTrials = 10;

int doWorkAndReturnMicrosecondsElapsed(int* vals, int pipelineLen){
        int* zeroNums = new int[pipelineLen];
        int* oneNums = new int[pipelineLen];
        for(int i = 0; i < pipelineLen; ++i)
                zeroNums[i] = oneNums[i] = 0;

        chrono::time_point<chrono::system_clock> start, end;
        start = chrono::system_clock::now();
        for(int i = 0; i < s_iArrayLen; ++i){
                if(vals[i] == 0){
                        for(int i = 0; i < pipelineLen; ++i)
                                ++zeroNums[i];
                }
                else{
                        for(int i = 0; i < pipelineLen; ++i)
                                ++oneNums[i];
                }
        }
        end = chrono::system_clock::now();
        int elapsedMicroseconds = (int)chrono::duration_cast<chrono::microseconds>(end-start).count();

        //This should never fire, it just exists to guarantee the compiler doesn't compile out our zeroNums/oneNums
        for(int i = 0; i < pipelineLen - 1; ++i)
                if(zeroNums[i] != zeroNums[i+1] || oneNums[i] != oneNums[i+1])
                        return -1;
        delete[] zeroNums;
        delete[] oneNums;
        return elapsedMicroseconds;
}

struct TestMethod{
        string name;
        void (*func)(int, int&);
        int* results;

        TestMethod(string _name, void (*_func)(int, int&)) { name = _name; func = _func; results = new int[s_iMaxPipelineLen]; }
};

int main(){
        srand( (unsigned int)time(nullptr) );

        vector<TestMethod> testMethods;
        testMethods.push_back(TestMethod("all-zero", [](int index, int& out) { out = 0; } ));
        testMethods.push_back(TestMethod("repeat-0-1", [](int index, int& out) { out = index % 2; } ));
        testMethods.push_back(TestMethod("repeat-0-0-0-1", [](int index, int& out) { out = (index % 4 == 0) ? 0 : 1; } ));
        testMethods.push_back(TestMethod("rand", [](int index, int& out) { out = rand() % 2; } ));

        int* vals = new int[s_iArrayLen];

        for(int currentPipelineLen = 0; currentPipelineLen < s_iMaxPipelineLen; ++currentPipelineLen){
                for(int currentMethod = 0; currentMethod < (int)testMethods.size(); ++currentMethod){
                        int resultsSum = 0;
                        for(int trialNum = 0; trialNum < s_iNumTrials; ++trialNum){
                                //Generate a new array...
                                for(int i = 0; i < s_iArrayLen; ++i)  
                                        testMethods[currentMethod].func(i, vals[i]);

                                //And record how long it takes
                                resultsSum += doWorkAndReturnMicrosecondsElapsed(vals, currentPipelineLen);
                        }

                        testMethods[currentMethod].results[currentPipelineLen] = (resultsSum / s_iNumTrials);
                }
        }

        cout << "\t";
        for(int i = 0; i < s_iMaxPipelineLen; ++i){
                cout << i << "\t";
        }
        cout << "\n";
        for (int i = 0; i < (int)testMethods.size(); ++i){
                cout << testMethods[i].name.c_str() << "\t";
                for(int j = 0; j < s_iMaxPipelineLen; ++j){
                        cout << testMethods[i].results[j] << "\t";
                }
                cout << "\n";
        }
        int end;
        cin >> end;
        delete[] vals;
}

Pastebin link: http://pastebin.com/F0JAu3uw

761

asked Jan 04 '13 05:01

Ben Walker

2 Answers

I think you may be measuring the cache/memory performance, more than the branch prediction. Your inner 'work' loop is accessing an ever increasing chunk of memory. Which may explain the linear growth, the periodic behaviour, etc.

I could be wrong, as I've not tried replicating your results, but if I were you I'd factor out memory accesses before timing other things. Perhaps sum one volatile variable into another, rather than working in an array.

Note also that, depending on the CPU, the branch prediction can be a lot smarter than just recording the last time a branch was taken - repeating patterns, for example, aren't as bad as random data.

Ok, a quick and dirty test I knocked up on my tea break which tried to mirror your own test method, but without thrashing the cache, looks like this:

enter image description here

Is that more what you expected?

If I can spare any time later there's something else I want to try, as I've not really looked at what the compiler is doing...

Edit:

And, here's my final test - I recoded it in assembler to remove the loop branching, ensure an exact number of instructions in each path, etc.

More branch prediction results

I also added an extra case, of a 5-bit repeating pattern. It seems pretty hard to upset the branch predictor on my ageing Xeon.

169

answered Oct 23 '22 15:10

JasonD

In addition to what JasonD pointed out, I would also like to note that there are conditions inside for loop, which may affect branch predictioning:

if(vals[i] == 0)
{
    for(int i = 0; i < pipelineLen; ++i)
        ++zeroNums[i];
}

i < pipelineLen; is a condition like your ifs. Of course compiler may unroll this loop, however pipelineLen is argument passed to a function so probably it does not.

I'm not sure if this can explain wavy pattern of your results, but:

Since the BTB is only 16 entries long in the Pentium 4 processor, the prediction will eventually fail for loops that are longer than 16 iterations. This limitation can be avoided by unrolling a loop until it is only 16 iterations long. When this is done, a loop conditional will always fit into the BTB, and a branch misprediction will not occur on loop exit. The following is an exam ple of loop unrolling:

Read full article: http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts

So your loops are not only measuring memory throughput but they are also affecting BTB.

If you have passed 0-1 pattern in your list but then executed a for loop with pipelineLen = 2 your BTB will be filled with something like 0-1-1-0 - 1-1-1-0 - 0-1-1-0 - 1-1-1-0 and then it will start to overlap, so this can indeed explain wavy pattern of your results (some overlaps will be more harmful than others).

Take this as an example of what may happen rather than literal explanation. Your CPU may have much more sophisticated branch prediction architecture.

answered Oct 23 '22 15:10

mip

Related questions
                            
                                Allocate vector size with list initialization (curly braces)
                            
                                What is the origin of the asctime format?
                            
                                Will a class that is declared within a member initializer of a constructor of another class be visible outside it?
                            
                                What are some good profilers for native C++ on Windows? [closed]
                            
                                Reading files larger than 4GB using c++ stl
                            
                                Running small C++ programs in Visual Studio without creating projects
                            
                                How to change a value in memory space of another process
                            
                                Is creating an empty class purely to distinguish it from another class good practice?
                            
                                c++ floating point precision loss: 3015/0.00025298219406977296
                            
                                Why does GCC need extra declarations in templates when VS does not?
                            
                                C++ Standard Library Exception List?
                            
                                What is the Linux equivalent of: MultiByteToWideChar & WideCharToMultiByte?
                            
                                should I include a header that is already included via other headers?
                            
                                convert std::wstring to const *char in c++
                            
                                Which string classes to use in C++?
                            
                                Cannot access protected member [duplicate]
                            
                                C++ Extension functions?
                            
                                Disable exception handling while debugging c++ project using Google Test
                            
                                Inline vs constexpr for a static const getter?
                            
                                What is the origin of void main?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With