Here's a little micro-optimization curiosity that I came up with:
struct Timer {
bool running{false};
int ticks{0};
void step_versionOne(int mStepSize) {
if(running) ticks += mStepSize;
}
void step_versionTwo(int mStepSize) {
ticks += mStepSize * static_cast<int>(running);
}
};
It seems the two methods practically do the same thing. Does the second version avoid a branch (and consequently, is faster than the first version), or is any compiler able to do this kind of optimization with -O3
?
Yes, your trick allows to avoid branch and it makes it faster... sometimes.
I wrote benchmark that compares these solutions in various situations, along with my own:
ticks += mStepSize & -static_cast<int>(running)
My results are following:
Off:
branch: 399949150
mul: 399940271
andneg: 277546678
On:
branch: 204035423
mul: 399937142
andneg: 277581853
Pattern:
branch: 327724860
mul: 400010363
andneg: 277551446
Random:
branch: 915235440
mul: 399916440
andneg: 277537411
Off
is when timers are turned off. In this cases solutions take about the same time.
On
is when they are turned on. Branching solution two times faster.
Pattern
is when they are in 100110 pattern. Performance is similar, but branching is a bit faster.
Random
is when branch is unpredictable. In this case multiplications is more than 2 times faster.
In all cases my bit-hacking trick is fastest, except for On
where branching wins.
Note that this benchmark is not necessarily representative for all compiler versions processors etc. Even small changes of benchmark can turn results upside down (for example if compiler can inline knowing mStepSize
is 1
than multiplication can be actually fastest).
Code of the benchmark:
#include<array>
#include<iostream>
#include<chrono>
struct Timer {
bool running{false};
int ticks{0};
void branch(int mStepSize) {
if(running) ticks += mStepSize;
}
void mul(int mStepSize) {
ticks += mStepSize * static_cast<int>(running);
}
void andneg(int mStepSize) {
ticks += mStepSize & -static_cast<int>(running);
}
};
void run(std::array<Timer, 256>& timers, int step) {
auto start = std::chrono::steady_clock::now();
for(int i = 0; i < 1000000; i++)
for(auto& t : timers)
t.branch(step);
auto end = std::chrono::steady_clock::now();
std::cout << "branch: " << (end - start).count() << std::endl;
start = std::chrono::steady_clock::now();
for(int i = 0; i < 1000000; i++)
for(auto& t : timers)
t.mul(step);
end = std::chrono::steady_clock::now();
std::cout << "mul: " << (end - start).count() << std::endl;
start = std::chrono::steady_clock::now();
for(int i = 0; i < 1000000; i++)
for(auto& t : timers)
t.andneg(step);
end = std::chrono::steady_clock::now();
std::cout << "andneg: " << (end - start).count() << std::endl;
}
int main() {
std::array<Timer, 256> timers;
int step = rand() % 256;
run(timers, step); // warm up
std::cout << "Off:\n";
run(timers, step);
for(auto& t : timers)
t.running = true;
std::cout << "On:\n";
run(timers, step);
std::array<bool, 6> pattern = {1, 0, 0, 1, 1, 0};
for(int i = 0; i < 256; i++)
timers[i].running = pattern[i % 6];
std::cout << "Pattern:\n";
run(timers, step);
for(auto& t : timers)
t.running = rand()&1;
std::cout << "Random:\n";
run(timers, step);
for(auto& t : timers)
std::cout << t.ticks << ' ';
return 0;
}
Does the second version avoid a branch
if you compile your code to get assembler output, g++ -o test.s test.cpp -S
, you'll find that a branch is indeed avoided in the second function.
and consequently, is faster than the first version
i ran each of your functions 2147483647
or INT_MAX
number of times where in each iteration i randomly assigned a boolean value to running
member of your Timer
struct, using this code:
int main() {
const int max = std::numeric_limits<int>::max();
timestamp_t start, end, one, two;
Timer t_one, t_two;
double percent;
srand(time(NULL));
start = get_timestamp();
for(int i = 0; i < max; ++i) {
t_one.running = rand() % 2;
t_one.step_versionOne(1);
}
end = get_timestamp();
one = end - start;
std::cout << "step_versionOne = " << one << std::endl;
start = get_timestamp();
for(int i = 0; i < max; ++i) {
t_two.running = rand() % 2;
t_two.step_versionTwo(1);
}
end = get_timestamp();
two = end - start;
percent = (one - two) / static_cast<double>(one) * 100.0;
std::cout << "step_versionTwo = " << two << std::endl;
std::cout << "step_one - step_two = " << one - two << std::endl;
std::cout << "one fast than two by = " << percent << std::endl;
}
and these are the results i got:
step_versionOne = 39738380
step_versionTwo = 26047337
step_one - step_two = 13691043
one fast than two by = 34.4529%
so yes, the second function is clearly faster, and by around 35%. note that the percentage increase in timed performance varied between 30 and 55 percent for a smaller number of iterations, whereas it seems to plateau at around 35% the longer it runs. this is might be due to sporadic execution of system tasks while the simulation is running, which become a lot less sporadic, i.e. consistent the longer you run the sim (although this is just my assumption, i have no idea if it's actually true)
all in all, nice question, i learned something today!
of course, by randomly generating running
, we are essentially rendering branch prediction useless in the first function, so the results above are not too surprising. however, if we decide to not alter running
during loop iterations and instead leave it at its default value, in this case false
, branch prediction will do its magic in the first function, and will actually be faster by almost 20% as these results suggest:
step_versionOne = 6273942
step_versionTwo = 7809508
step_two - step_one = 1535566
two fast than one by = 19.6628
because running
is constant throughout execution, notice that the simulation time is much shorter than it was with a randomly changing running
- result of a compiler optimization likely.
why is the second function slower in this case? well, branch prediction will quickly realize that the condition in the first function is never met, and so will stop checking in the first place (as though if(running) ticks += mStepSize;
isn't even there). on the other hand, the second function will still have to perform this instruction ticks += mStepSize * static_cast<int>(running);
in every iteration, thus making the first function more efficient.
but what if we set running
to true
? well, the branch prediction will kick in again, however, this time, the first function will have to evaluate ticks += mStepSize;
in every iteration; here the results when running{true}
:
step_versionOne = 7522095
step_versionTwo = 7891948
step_two - step_one = 369853
two fast than one by = 4.68646
notice that step_versionTwo
takes a consistent amount of time whether running
is constantly true
or false
. but it still takes longer than step_versionTwo
, however marginally. well, this might be because i was too lazy to run it a lot of times to determine whether it's consistently faster or whether it was a one time fluke (results vary slightly every time you run it, since the OS has to run in the background and it's not always going to do the same thing). but if it is consistently faster, it might be because function two (ticks += mStepSize * static_cast<int>(running);
) has an arithmetic op more than function one (ticks += mStepSize;
).
finally, let's compile with an optimization - g++ -o test test.cpp -std=c++11 -O1
and let's revert running
back to false
and then check the results:
step_versionOne = 704973
step_versionTwo = 695052
more or less the same. the compiler will do its optimization pass, and realize running
is always false
and will thus, for all intents and purposes, remove the body of step_versionOne
, so when you call it from the loop in main
, it'll just call the function and return.
on the other hand, when optimizing the second function, it will realize that ticks += mStepSize * static_cast<int>(running);
will always generate the same result, i.e. 0
, so it won't bother executing that either.
all in all, if i'm correct (and if not, please correct me, i'm pretty new to this), all you'll get when calling both functions from the main
loop is their overhead.
p.s. here's the result for the first case (running
is randomly generated in every iteration) when compiled with an optimization.
step_versionOne = 18868782
step_versionTwo = 18812315
step_two - step_one = 56467
one fast than two by = 0.299261
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With