Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using bools in calculations to avoid branches

Here's a little micro-optimization curiosity that I came up with:

struct Timer {
    bool running{false};
    int ticks{0};

    void step_versionOne(int mStepSize) {
        if(running) ticks += mStepSize;
    }

    void step_versionTwo(int mStepSize) {
        ticks += mStepSize * static_cast<int>(running);
    }
};

It seems the two methods practically do the same thing. Does the second version avoid a branch (and consequently, is faster than the first version), or is any compiler able to do this kind of optimization with -O3?

like image 317
Vittorio Romeo Avatar asked Nov 23 '13 15:11

Vittorio Romeo


2 Answers

Yes, your trick allows to avoid branch and it makes it faster... sometimes.

I wrote benchmark that compares these solutions in various situations, along with my own:

ticks += mStepSize & -static_cast<int>(running)

My results are following:

Off:
 branch: 399949150
 mul:    399940271
 andneg: 277546678
On:
 branch: 204035423
 mul:    399937142
 andneg: 277581853
Pattern:
 branch: 327724860
 mul:    400010363
 andneg: 277551446
Random:
 branch: 915235440
 mul:    399916440
 andneg: 277537411

Off is when timers are turned off. In this cases solutions take about the same time.

On is when they are turned on. Branching solution two times faster.

Pattern is when they are in 100110 pattern. Performance is similar, but branching is a bit faster.

Random is when branch is unpredictable. In this case multiplications is more than 2 times faster.

In all cases my bit-hacking trick is fastest, except for On where branching wins.

Note that this benchmark is not necessarily representative for all compiler versions processors etc. Even small changes of benchmark can turn results upside down (for example if compiler can inline knowing mStepSize is 1 than multiplication can be actually fastest).

Code of the benchmark:

#include<array>
#include<iostream>
#include<chrono>

struct Timer {
    bool running{false};
    int ticks{0};

    void branch(int mStepSize) {
        if(running) ticks += mStepSize;
    }

    void mul(int mStepSize) {
        ticks += mStepSize * static_cast<int>(running);
    }

    void andneg(int mStepSize) {
        ticks += mStepSize & -static_cast<int>(running);
    }
};

void run(std::array<Timer, 256>& timers, int step) {
    auto start = std::chrono::steady_clock::now();
    for(int i = 0; i < 1000000; i++)
        for(auto& t : timers)
            t.branch(step);
    auto end = std::chrono::steady_clock::now();
    std::cout << "branch: " << (end - start).count() << std::endl;
    start = std::chrono::steady_clock::now();
    for(int i = 0; i < 1000000; i++)
        for(auto& t : timers)
            t.mul(step);
    end = std::chrono::steady_clock::now();
    std::cout << "mul:    " << (end - start).count() << std::endl;
    start = std::chrono::steady_clock::now();
    for(int i = 0; i < 1000000; i++)
        for(auto& t : timers)
            t.andneg(step);
    end = std::chrono::steady_clock::now();
    std::cout << "andneg: " << (end - start).count() << std::endl;
}

int main() {
    std::array<Timer, 256> timers;
    int step = rand() % 256;

    run(timers, step); // warm up
    std::cout << "Off:\n";
    run(timers, step);
    for(auto& t : timers)
        t.running = true;
    std::cout << "On:\n";
    run(timers, step);
    std::array<bool, 6> pattern = {1, 0, 0, 1, 1, 0};
    for(int i = 0; i < 256; i++)
        timers[i].running = pattern[i % 6];
    std::cout << "Pattern:\n";
    run(timers, step);
    for(auto& t : timers)
        t.running = rand()&1;
    std::cout << "Random:\n";
    run(timers, step);
    for(auto& t : timers)
        std::cout << t.ticks << ' ';
    return 0;
}
like image 125
zch Avatar answered Oct 01 '22 19:10

zch


Does the second version avoid a branch

if you compile your code to get assembler output, g++ -o test.s test.cpp -S, you'll find that a branch is indeed avoided in the second function.

and consequently, is faster than the first version

i ran each of your functions 2147483647 or INT_MAX number of times where in each iteration i randomly assigned a boolean value to running member of your Timer struct, using this code:

int main() {
    const int max = std::numeric_limits<int>::max();
    timestamp_t start, end, one, two;
    Timer t_one, t_two;
    double percent;

    srand(time(NULL));

    start = get_timestamp();
    for(int i = 0; i < max; ++i) {
        t_one.running = rand() % 2;
        t_one.step_versionOne(1);
    }
    end = get_timestamp();
    one = end - start;

    std::cout << "step_versionOne      = " << one << std::endl;

    start = get_timestamp();
    for(int i = 0; i < max; ++i) {
        t_two.running = rand() % 2;
        t_two.step_versionTwo(1);
    }
    end = get_timestamp();
    two = end - start;

    percent = (one - two) / static_cast<double>(one) * 100.0;

    std::cout << "step_versionTwo      = " << two << std::endl;
    std::cout << "step_one - step_two  = " << one - two << std::endl;
    std::cout << "one fast than two by = " << percent << std::endl;
 }

and these are the results i got:

step_versionOne      = 39738380
step_versionTwo      = 26047337
step_one - step_two  = 13691043
one fast than two by = 34.4529%

so yes, the second function is clearly faster, and by around 35%. note that the percentage increase in timed performance varied between 30 and 55 percent for a smaller number of iterations, whereas it seems to plateau at around 35% the longer it runs. this is might be due to sporadic execution of system tasks while the simulation is running, which become a lot less sporadic, i.e. consistent the longer you run the sim (although this is just my assumption, i have no idea if it's actually true)

all in all, nice question, i learned something today!


MORE:


of course, by randomly generating running, we are essentially rendering branch prediction useless in the first function, so the results above are not too surprising. however, if we decide to not alter running during loop iterations and instead leave it at its default value, in this case false, branch prediction will do its magic in the first function, and will actually be faster by almost 20% as these results suggest:

step_versionOne      = 6273942
step_versionTwo      = 7809508
step_two - step_one  = 1535566
two fast than one by = 19.6628

because running is constant throughout execution, notice that the simulation time is much shorter than it was with a randomly changing running - result of a compiler optimization likely.

why is the second function slower in this case? well, branch prediction will quickly realize that the condition in the first function is never met, and so will stop checking in the first place (as though if(running) ticks += mStepSize; isn't even there). on the other hand, the second function will still have to perform this instruction ticks += mStepSize * static_cast<int>(running); in every iteration, thus making the first function more efficient.

but what if we set running to true? well, the branch prediction will kick in again, however, this time, the first function will have to evaluate ticks += mStepSize; in every iteration; here the results when running{true}:

step_versionOne      = 7522095
step_versionTwo      = 7891948
step_two - step_one  = 369853
two fast than one by = 4.68646

notice that step_versionTwo takes a consistent amount of time whether running is constantly true or false. but it still takes longer than step_versionTwo, however marginally. well, this might be because i was too lazy to run it a lot of times to determine whether it's consistently faster or whether it was a one time fluke (results vary slightly every time you run it, since the OS has to run in the background and it's not always going to do the same thing). but if it is consistently faster, it might be because function two (ticks += mStepSize * static_cast<int>(running);) has an arithmetic op more than function one (ticks += mStepSize;).

finally, let's compile with an optimization - g++ -o test test.cpp -std=c++11 -O1 and let's revert running back to false and then check the results:

step_versionOne      = 704973
step_versionTwo      = 695052

more or less the same. the compiler will do its optimization pass, and realize running is always false and will thus, for all intents and purposes, remove the body of step_versionOne, so when you call it from the loop in main, it'll just call the function and return.

on the other hand, when optimizing the second function, it will realize that ticks += mStepSize * static_cast<int>(running); will always generate the same result, i.e. 0, so it won't bother executing that either.

all in all, if i'm correct (and if not, please correct me, i'm pretty new to this), all you'll get when calling both functions from the main loop is their overhead.

p.s. here's the result for the first case (running is randomly generated in every iteration) when compiled with an optimization.

step_versionOne      = 18868782
step_versionTwo      = 18812315
step_two - step_one  = 56467
one fast than two by = 0.299261
like image 22
stellarossa Avatar answered Oct 01 '22 20:10

stellarossa