We were developing a piece of code that would check whenever the user should not be allowed to get into a sector within a time period, one of my colleagues created a function which in the code below is the isAllowed and contains several comparisons, I took a different approach which is the function isAllowed2 which uses the amount of seconds between time periods.
At first we had no doubt that his function would be faster, but that is not true when actually running the code and comparing the speed, even if the difference is something we can completely ignore, we would like to know why is it that the one that "should" be faster is in fact slower.
Considering the following code:
#include <iostream>
#include <chrono>
using namespace std;
using namespace std::chrono;
struct timing {
short hour;
short minute;
};
bool isAllowed(timing &from, timing &to, timing &actual) {
return !(((from.hour > to.hour && (actual.hour >= from.hour || actual.hour <= to.hour)) ||
(actual.hour >= from.hour && actual.hour <= to.hour)) &&
!(actual.minute > from.minute && actual.minute < to.minute));
}
long getSecs(short hour, short minutes) {
return (hour * 3600) + (minutes * 60);
}
bool isAllowed2(timing &from, timing &to, timing ¤t) {
long secsFrom = getSecs(from.hour, from.minute);
long secsTo = getSecs(to.hour, to.minute);
long secsCurrent = getSecs(current.hour, current.minute);
if (secsFrom > secsTo) secsTo += 24 * 60 * 60;
if (secsCurrent > secsFrom && secsCurrent < secsTo) {
return false;
}
return true;
}
int main() {
//debug messages
std::string okay = " - ok";
std::string error = " - er";
std::string receive = " - allowed";
std::string notReceive = " - denied";
//testing times
int const testDataCount = 5;
timing from[testDataCount] = {
{ 16, 30 },
{ 8, 30 },
{ 10, 30 },
{ 0, 30 },
{ 0, 0 }
};
timing to[testDataCount] = {
{ 8, 30 },
{ 20, 0 },
{ 20, 0 },
{ 6, 0 },
{ 7, 0 }
};
for (int i = 0; i < testDataCount; i++) {
std::cout << i + 1 << ": " << from[i].hour << ":" << from[i].minute << " to " << to[i].hour << ":"
<< to[i].minute << std::endl;
}
//test current times
timing current[5] = {
{ 12, 0 },
{ 23, 0 },
{ 17, 30 },
{ 15, 12 },
{ 0, 20 }
};
bool ergValues[][testDataCount] = {
{ true, false, false, true, true },
{ false, true, true, true, true },
{ false, false, false, true, true },
{ true, false, false, true, true },
{ false, true, true, true, false }
};
long totalNs1 = 0;
long totalNs2 = 0;
for (int i = 0; i < 4; i++) {
std::cout << std::endl << i + 1 << ". Test: " << current[i].hour << ":" << current[i].minute << std::endl;
for (int j = 0; j < testDataCount; j++) {
high_resolution_clock::time_point t1 = high_resolution_clock::now();
bool response = isAllowed(from[j], to[j], current[i]);
high_resolution_clock::time_point t2 = high_resolution_clock::now();
high_resolution_clock::time_point t3 = high_resolution_clock::now();
bool response2 = isAllowed2(from[j], to[j], current[i]);
high_resolution_clock::time_point t4 = high_resolution_clock::now();
long ns1 = duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
totalNs1 += ns1;
long ns2 = duration_cast<std::chrono::nanoseconds>(t4 - t3).count();
totalNs2 += ns2;
std::cout << j + 1 << "\t\t:1:" << ns1 << "ns: " << response << (response == ergValues[i][j] ? okay : error) << "\t\t:2:" << ns2 << "ms: " << response2 << (response2 == ergValues[i][j] ? okay : error) << "\t\t"
<< (ergValues[i][j] ? receive : notReceive) << std::endl;
}
}
std::cout << "\r\ntotalNs1 = " << totalNs1 << "\r\ntotalNs2 = " << totalNs2 << "\r\n\r\n";
return 0;
}
The result would obviously always differ, but no matter what the totalNs2 would always be smaller than the totalNs1.
Ex:
totalNs1 = 38796
totalNs2 = 25913
I tested this on a AMD Phenom II X4 and an Intel i7-3770, both on Windows 10 and Debian 8, with quite similar results.
So finally the question is, why is the function isAllowed2 faster than the isAllowed?
Note: This is mostly a curiosity question, if someone thinks the title or the tags are not the most appropriate please let me know and I'l change them accordingly, please excuse any eventual grammar errors as English is not my native language.
Meanwhile I've been researching further based on all the suggestions and comments, including this incredibly detailed answer, after understanding how inaccurate micro-benchmarking can be (a HUGE thanks to Baum mit Augen for linking this amazing talk, which helped a lot) I ended up using the Google microbenchmark library to get more "accurate" results, it seems that the isAllowed function is in fact faster (compiled without optimization) as shown in the output from the library.
Run on (8 X 2395 MHz CPU s)
-----------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
BM_isAllowed/2/min_time:2.000 22 ns 22 ns 128000000
BM_isAllowed/4/min_time:2.000 22 ns 22 ns 137846154
BM_isAllowed/8/min_time:2.000 22 ns 22 ns 128000000
BM_isAllowed/16/min_time:2.000 22 ns 22 ns 128000000
BM_isAllowed/22/min_time:2.000 22 ns 22 ns 137846154
BM_isAllowed2/2/min_time:2.000 24 ns 24 ns 112000000
BM_isAllowed2/4/min_time:2.000 24 ns 24 ns 119466667
BM_isAllowed2/8/min_time:2.000 24 ns 24 ns 119466667
BM_isAllowed2/16/min_time:2.000 24 ns 24 ns 119466667
BM_isAllowed2/22/min_time:2.000 24 ns 24 ns 119466667
Note: As Martin Bonner pointed out, the isAllowed function seems to have a logic flaw, don't use it production code.
Below you can find the code I used to do this benchmark, please let me know if there are any flaws in it as I'm not familiar with the Google library.
Important, this code was compiled with Visual Studio 2015 and optimization should be disabled for the section that we want to test.
#include "benchmark/benchmark.h"
using namespace std;
using namespace benchmark;
#pragma optimize( "[optimization-list]", {on | off} )
volatile const long extraDay = 24 * 60 * 60;
#pragma optimize( "", off )
struct timing {
short hour;
short minute;
timing(short hour, short minute) : hour(hour), minute(minute) {}
};
static void BM_isAllowed(benchmark::State& state) {
while (state.KeepRunning())
{
timing from(state.range(0), state.range(0));
timing to(state.range(0), state.range(0));
timing current(state.range(0), state.range(0));
bool b = !(((from.hour > to.hour && (current.hour >= from.hour || current.hour <= to.hour)) ||
(current.hour >= from.hour && current.hour <= to.hour)) &&
!(current.minute > from.minute && current.minute < to.minute));
}
}
static void BM_isAllowed2(benchmark::State& state) {
while (state.KeepRunning())
{
timing from(state.range(0), state.range(0));
timing to(state.range(0), state.range(0));
timing current(state.range(0), state.range(0));
bool b;
long secsFrom = secsFrom = (from.hour * 3600) + (from.minute * 60);
long secsTo = (to.hour * 3600) + (to.minute * 60);
long secsCurrent = (current.hour * 3600) + (current.minute * 60);
if (secsFrom > secsTo)
secsTo += extraDay;
if (secsCurrent > secsFrom && secsCurrent < secsTo)
b = false;
else
b = true;
}
}
#pragma optimize( "", on )
BENCHMARK(BM_isAllowed)->RangeMultiplier(2)->Range(2, 22)->MinTime(2);
BENCHMARK(BM_isAllowed2)->RangeMultiplier(2)->Range(2, 22)->MinTime(2);
BENCHMARK_MAIN();
Avoid Volatile Formulas For example, if you use NOW function in a cell, every time there is a change in the worksheet, the formula would be recalculated and the cell value would update. This takes additional processing speed and you end up with a slow excel workbook. As a rule of thumb, avoid volatile formulas.
The number of records (rows), fields (columns), and formulas can slow down performance considerably. Every time you add new records, then press the Enter key—or use features such as Sort, Format cells, or Insert/Delete Columns or Rows—Excel recalculates all those formulas.
The only problem is that named ranges, especially dynamic ranges, consume a considerable amount of working memory. Overusing them may slow down your Excel spreadsheet.
Decrease the number of worksheets Excel calculates a workbook faster if data and formulas reside in the same worksheet. Try using fewer worksheets in your workbook!
There is no golden rule for this. Unfortunately, the performance of code like this is notoriously hard to predict. The most important thing to take away from that is
Now to what's going on in your code: As others noted correctly, we can observe that isAllowed
gets compiled to a function using branches, while isAllowed2
ends up being branchless.
Branches are interesting when talking about performance: They are somewhere between literally free and ridiculously expensive, inclusively. This is due to a CPU component called the branch predictor. It tries to predict which branch your control flow will take and makes the CPU speculatively execute it. If it guesses right, the branch is free. If it guesses wrong, the branch is expensive. A great and detailed explanation of that concept, including some numbers, can be found in this answer.
So now we need to decide whether we want the branching or the branchless version. In general, neither need be faster than the other! It really depends on how well your target CPUs can predict the branches, which of course depends on the actual input. (Choosing whether to compile a function to a branching or a branchless result is thus a hard problem for compilers as they don't know what CPUs the binary will be run on, nor what kind of input data to expect. See for example this blogpost.)
So if your benchmark was actually correct†, we have determined that on your CPU the branches are too hard to predict to beat the relatively cheap integer arithmetic. This may also be due to the tiny amount of test cases, the branch predictor cannot learn a pattern from such few invocations. But again, we cannot just call that one way or the other, we have to look at the actual performance in the specific case.
†As noted in the comments, the execution time is somewhat short for a good measurement, I see huge deviations on my machine. For information about micro benchmarking you can have a look at this talk, it's harder than one might think.
Also, as Martin Bonner helpfully noticed, your two functions don't do the same thing, you'd have to fix that for a correct benchmark of course.
Because you are not measuring what you want to measure.
In fact to execute your two functions takes around 40ns on my computer, but if I use your test code I get a result of the order of 500ns.
You are not performing the measurment you want because: 1. The time to execute only once these functions is of the same order (even smaller) than the execution time of the function that actually get the clock. As a rule on the thumb, to test, try to measure time that are larger than 10ms. 2. Between the two ticks the compiler has placed an aggressively inlined and optimized versions of your functions because it knows what are the inputs, which is probably what will not happen in the real case.
If you put the definition of your two functions in a different file than the file where are defined your inputs:
//is_allowed.cpp
struct timing {
short hour;
short minute;
};
bool isAllowed(timing &from, timing &to, timing &actual) {
return !(((from.hour > to.hour && (actual.hour >= from.hour || actual.hour <= to.hour)) ||
(actual.hour >= from.hour && actual.hour <= to.hour)) &&
!(actual.minute > from.minute && actual.minute < to.minute));
}
static long getSecs(short hour, short minutes) {
return (hour * 3600) + (minutes * 60);
}
bool isAllowed2(timing &from, timing &to, timing ¤t) {
long secsFrom = getSecs(from.hour, from.minute);
long secsTo = getSecs(to.hour, to.minute);
long secsCurrent = getSecs(current.hour, current.minute);
if (secsFrom > secsTo) secsTo += 24 * 60 * 60;
if (secsCurrent > secsFrom && secsCurrent < secsTo) {
return false;
}
return true;
}
And then execute a million of time your functions between the "ticks", you will get a much more reliable result:
int main(){
//...
high_resolution_clock::time_point t1 = high_resolution_clock::now();
for (int x=1;x<1000000;++x)
isAllowed(from[j], to[j], current[i]);
high_resolution_clock::time_point t2 = high_resolution_clock::now();
high_resolution_clock::time_point t3 = high_resolution_clock::now();
for (int x=1;x<1000000;++x)
isAllowed2(from[j], to[j], current[i]);
high_resolution_clock::time_point t4 = high_resolution_clock::now();
long ns1 = duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
totalNs1 += ns1;
long ns2 = duration_cast<std::chrono::nanoseconds>(t4 - t3).count();
totalNs2 += ns2;
// std::cout << j + 1 << "\t\t:1:" << ns1 << "ns: " << response << (response == ergValues[i][j] ? okay : error) << "\t\t:2:" << ns2 << "ms: " << response2 << (response2 == ergValues[i][j] ? okay : error) << "\t\t"
// << (ergValues[i][j] ? receive : notReceive) << std::endl;
//...
}
Surprise, I get:
totalNs1=38085793 //(38ms)
totalNs2=52182920 //(52ms)
While with your code with the exact same computer and compiler I got:
totalNs1 = 927
totalNs2 = 587
As you expected the first version of isAllowed
is actually the winner!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With