How can several comparisons be slower than some calculations?

Tags:

We were developing a piece of code that would check whenever the user should not be allowed to get into a sector within a time period, one of my colleagues created a function which in the code below is the isAllowed and contains several comparisons, I took a different approach which is the function isAllowed2 which uses the amount of seconds between time periods.

At first we had no doubt that his function would be faster, but that is not true when actually running the code and comparing the speed, even if the difference is something we can completely ignore, we would like to know why is it that the one that "should" be faster is in fact slower.

Considering the following code:

#include <iostream>
#include <chrono>

using namespace std;
using namespace std::chrono;

struct timing {
    short hour;
    short minute;
};

bool isAllowed(timing &from, timing &to, timing &actual) {
    return !(((from.hour > to.hour && (actual.hour >= from.hour || actual.hour <= to.hour)) ||
        (actual.hour >= from.hour && actual.hour <= to.hour)) &&
        !(actual.minute > from.minute && actual.minute < to.minute));
}

long getSecs(short hour, short minutes) {

    return (hour * 3600) + (minutes * 60);

}

bool isAllowed2(timing &from, timing &to, timing &current) {

    long secsFrom = getSecs(from.hour, from.minute);
    long secsTo = getSecs(to.hour, to.minute);
    long secsCurrent = getSecs(current.hour, current.minute);

    if (secsFrom > secsTo) secsTo += 24 * 60 * 60;
    if (secsCurrent > secsFrom && secsCurrent < secsTo) {
        return false;
    }

    return true;
}

int main() {
    //debug messages
    std::string okay = " - ok";
    std::string error = " - er";

    std::string receive = " - allowed";
    std::string notReceive = " - denied";

    //testing times
    int const testDataCount = 5;
    timing from[testDataCount] = {
        { 16, 30 },
        { 8,  30 },
        { 10, 30 },
        { 0, 30 },
        { 0, 0 }
    };
    timing to[testDataCount] = {
        { 8,  30 },
        { 20, 0 },
        { 20, 0 },
        { 6, 0 },
        { 7, 0 }
    };

    for (int i = 0; i < testDataCount; i++) {
        std::cout << i + 1 << ": " << from[i].hour << ":" << from[i].minute << " to " << to[i].hour << ":"
            << to[i].minute << std::endl;
    }

    //test current times
    timing current[5] = {
        { 12, 0 },
        { 23, 0 },
        { 17, 30 },
        { 15, 12 },
        { 0, 20 }
    };

    bool ergValues[][testDataCount] = {
        { true,  false, false, true, true },
        { false, true,  true, true, true },
        { false, false, false, true, true },
        { true,  false, false, true, true },
        { false,  true, true, true, false }
    };

    long totalNs1 = 0;
    long totalNs2 = 0;

    for (int i = 0; i < 4; i++) {
        std::cout << std::endl << i + 1 << ". Test: " << current[i].hour << ":" << current[i].minute << std::endl;
        for (int j = 0; j < testDataCount; j++) {

            high_resolution_clock::time_point t1 = high_resolution_clock::now();
            bool response = isAllowed(from[j], to[j], current[i]);
            high_resolution_clock::time_point t2 = high_resolution_clock::now();

            high_resolution_clock::time_point t3 = high_resolution_clock::now();
            bool response2 = isAllowed2(from[j], to[j], current[i]);
            high_resolution_clock::time_point t4 = high_resolution_clock::now();

            long ns1 = duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
            totalNs1 += ns1;
            long ns2 = duration_cast<std::chrono::nanoseconds>(t4 - t3).count();
            totalNs2 += ns2;

            std::cout << j + 1 << "\t\t:1:" << ns1 << "ns: " << response << (response == ergValues[i][j] ? okay : error) << "\t\t:2:" << ns2 << "ms: " << response2 << (response2 == ergValues[i][j] ? okay : error) << "\t\t"
                << (ergValues[i][j] ? receive : notReceive) << std::endl;
        }
    }

    std::cout << "\r\ntotalNs1 = " << totalNs1 << "\r\ntotalNs2 = " << totalNs2 << "\r\n\r\n";

    return 0;
}

The result would obviously always differ, but no matter what the totalNs2 would always be smaller than the totalNs1.

Ex:

totalNs1 = 38796
totalNs2 = 25913

I tested this on a AMD Phenom II X4 and an Intel i7-3770, both on Windows 10 and Debian 8, with quite similar results.

So finally the question is, why is the function isAllowed2 faster than the isAllowed?

Note: This is mostly a curiosity question, if someone thinks the title or the tags are not the most appropriate please let me know and I'l change them accordingly, please excuse any eventual grammar errors as English is not my native language.

EDIT

Meanwhile I've been researching further based on all the suggestions and comments, including this incredibly detailed answer, after understanding how inaccurate micro-benchmarking can be (a HUGE thanks to Baum mit Augen for linking this amazing talk, which helped a lot) I ended up using the Google microbenchmark library to get more "accurate" results, it seems that the isAllowed function is in fact faster (compiled without optimization) as shown in the output from the library.

Run on (8 X 2395 MHz CPU s)
-----------------------------------------------------------------------
Benchmark                                Time           CPU Iterations
-----------------------------------------------------------------------
BM_isAllowed/2/min_time:2.000           22 ns         22 ns  128000000
BM_isAllowed/4/min_time:2.000           22 ns         22 ns  137846154
BM_isAllowed/8/min_time:2.000           22 ns         22 ns  128000000
BM_isAllowed/16/min_time:2.000          22 ns         22 ns  128000000
BM_isAllowed/22/min_time:2.000          22 ns         22 ns  137846154
BM_isAllowed2/2/min_time:2.000          24 ns         24 ns  112000000
BM_isAllowed2/4/min_time:2.000          24 ns         24 ns  119466667
BM_isAllowed2/8/min_time:2.000          24 ns         24 ns  119466667
BM_isAllowed2/16/min_time:2.000         24 ns         24 ns  119466667
BM_isAllowed2/22/min_time:2.000         24 ns         24 ns  119466667

Note: As Martin Bonner pointed out, the isAllowed function seems to have a logic flaw, don't use it production code.

Below you can find the code I used to do this benchmark, please let me know if there are any flaws in it as I'm not familiar with the Google library.

Important, this code was compiled with Visual Studio 2015 and optimization should be disabled for the section that we want to test.

#include "benchmark/benchmark.h"

using namespace std;
using namespace benchmark;

#pragma optimize( "[optimization-list]", {on | off} )

volatile const long extraDay = 24 * 60 * 60;

#pragma optimize( "", off )

struct timing {
    short hour;
    short minute;
    timing(short hour, short minute) : hour(hour), minute(minute) {}
};

static void BM_isAllowed(benchmark::State& state) {

    while (state.KeepRunning())
    {
        timing from(state.range(0), state.range(0));
        timing to(state.range(0), state.range(0));
        timing current(state.range(0), state.range(0));

        bool b = !(((from.hour > to.hour && (current.hour >= from.hour || current.hour <= to.hour)) ||
            (current.hour >= from.hour && current.hour <= to.hour)) &&
            !(current.minute > from.minute && current.minute < to.minute));
    }
}

static void BM_isAllowed2(benchmark::State& state) {

    while (state.KeepRunning())
    {
        timing from(state.range(0), state.range(0));
        timing to(state.range(0), state.range(0));
        timing current(state.range(0), state.range(0));

        bool b;
        long secsFrom = secsFrom = (from.hour * 3600) + (from.minute * 60);
        long secsTo = (to.hour * 3600) + (to.minute * 60);
        long secsCurrent = (current.hour * 3600) + (current.minute * 60);

        if (secsFrom > secsTo)
            secsTo += extraDay;
        if (secsCurrent > secsFrom && secsCurrent < secsTo)
            b = false;
        else
            b = true;
    }
}
#pragma optimize( "", on ) 

BENCHMARK(BM_isAllowed)->RangeMultiplier(2)->Range(2, 22)->MinTime(2);
BENCHMARK(BM_isAllowed2)->RangeMultiplier(2)->Range(2, 22)->MinTime(2);
BENCHMARK_MAIN();

748

asked Mar 06 '17 16:03

Mike

2 Answers

There is no golden rule for this. Unfortunately, the performance of code like this is notoriously hard to predict. The most important thing to take away from that is

Measure everything!

Now to what's going on in your code: As others noted correctly, we can observe that isAllowed gets compiled to a function using branches, while isAllowed2 ends up being branchless.

Branches are interesting when talking about performance: They are somewhere between literally free and ridiculously expensive, inclusively. This is due to a CPU component called the branch predictor. It tries to predict which branch your control flow will take and makes the CPU speculatively execute it. If it guesses right, the branch is free. If it guesses wrong, the branch is expensive. A great and detailed explanation of that concept, including some numbers, can be found in this answer.

So now we need to decide whether we want the branching or the branchless version. In general, neither need be faster than the other! It really depends on how well your target CPUs can predict the branches, which of course depends on the actual input. (Choosing whether to compile a function to a branching or a branchless result is thus a hard problem for compilers as they don't know what CPUs the binary will be run on, nor what kind of input data to expect. See for example this blogpost.)

So if your benchmark was actually correct^†, we have determined that on your CPU the branches are too hard to predict to beat the relatively cheap integer arithmetic. This may also be due to the tiny amount of test cases, the branch predictor cannot learn a pattern from such few invocations. But again, we cannot just call that one way or the other, we have to look at the actual performance in the specific case.

^†As noted in the comments, the execution time is somewhat short for a good measurement, I see huge deviations on my machine. For information about micro benchmarking you can have a look at this talk, it's harder than one might think.

Also, as Martin Bonner helpfully noticed, your two functions don't do the same thing, you'd have to fix that for a correct benchmark of course.

134

answered Oct 01 '22 16:10

Baum mit Augen

Because you are not measuring what you want to measure.

In fact to execute your two functions takes around 40ns on my computer, but if I use your test code I get a result of the order of 500ns.

You are not performing the measurment you want because: 1. The time to execute only once these functions is of the same order (even smaller) than the execution time of the function that actually get the clock. As a rule on the thumb, to test, try to measure time that are larger than 10ms. 2. Between the two ticks the compiler has placed an aggressively inlined and optimized versions of your functions because it knows what are the inputs, which is probably what will not happen in the real case.

If you put the definition of your two functions in a different file than the file where are defined your inputs:

//is_allowed.cpp
struct timing {
    short hour;
    short minute;
};
bool isAllowed(timing &from, timing &to, timing &actual) {
    return !(((from.hour > to.hour && (actual.hour >= from.hour || actual.hour <= to.hour)) ||
        (actual.hour >= from.hour && actual.hour <= to.hour)) &&
        !(actual.minute > from.minute && actual.minute < to.minute));
}

static long getSecs(short hour, short minutes) {

    return (hour * 3600) + (minutes * 60);

}

bool isAllowed2(timing &from, timing &to, timing &current) {

    long secsFrom = getSecs(from.hour, from.minute);
    long secsTo = getSecs(to.hour, to.minute);
    long secsCurrent = getSecs(current.hour, current.minute);

    if (secsFrom > secsTo) secsTo += 24 * 60 * 60;
    if (secsCurrent > secsFrom && secsCurrent < secsTo) {
        return false;
    }

    return true;
}

And then execute a million of time your functions between the "ticks", you will get a much more reliable result:

int main(){
//...

            high_resolution_clock::time_point t1 = high_resolution_clock::now();
            for (int x=1;x<1000000;++x)
               isAllowed(from[j], to[j], current[i]);
            high_resolution_clock::time_point t2 = high_resolution_clock::now();

            high_resolution_clock::time_point t3 = high_resolution_clock::now();
            for (int x=1;x<1000000;++x)
               isAllowed2(from[j], to[j], current[i]);
            high_resolution_clock::time_point t4 = high_resolution_clock::now();

            long ns1 = duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
            totalNs1 += ns1;
            long ns2 = duration_cast<std::chrono::nanoseconds>(t4 - t3).count();
            totalNs2 += ns2;

//            std::cout << j + 1 << "\t\t:1:" << ns1 << "ns: " << response << (response == ergValues[i][j] ? okay : error) << "\t\t:2:" << ns2 << "ms: " << response2 << (response2 == ergValues[i][j] ? okay : error) << "\t\t"
//                << (ergValues[i][j] ? receive : notReceive) << std::endl;
//...
    }

Surprise, I get:

    totalNs1=38085793  //(38ms)
    totalNs2=52182920  //(52ms)

While with your code with the exact same computer and compiler I got:

    totalNs1 = 927
    totalNs2 = 587

As you expected the first version of isAllowed is actually the winner!

answered Oct 01 '22 15:10

Oliv

Related questions
                            
                                Bad optimization of std::fabs()?
                            
                                Match type of inherited member functions
                            
                                Does std::addressof negate the STL operator& requirement?
                            
                                Implement a high performance mutex similar to Qt's one
                            
                                Can a C++ lambda constructor argument capture the constructed variable?
                            
                                Is it undefined behavior to #define/#undef an identifier with special meaning?
                            
                                Extract a subvector from a vector without copy
                            
                                Cython C++ static methods in a template class
                            
                                Is (or will be) the use of familiar template syntax in lambda expressions allowed?
                            
                                packing algorithm in rtree in boost
                            
                                C++ meta-programming: A template parameter which *must* inherit an abstract class
                            
                                Implementing swap for class with std::mutex
                            
                                When do we practically need 'explicit xvalues'?
                            
                                Need to enable C++11 in Codeblocks
                            
                                Editing array to ensure strictly increasing values
                            
                                Implementing signals (Observer pattern): is mutable or const_cast necessary?
                            
                                Division with negative dividend, but rounded towards negative infinity?
                            
                                terminate called after throwing an instance of 'std::invalid_argument' what(): stoi
                            
                                How to deploy Qt applications for Linux
                            
                                shared_ptr that cannot be null?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can several comparisons be slower than some calculations?

Tags:

c++

performance

optimization

c++11