How to measure the call overhead of functions?

Tags:

I wanted to measure and compare the overhead of different function calls. Different in the sense of two alternative ways to deal with extending the class while minimizing code modification:

using an abstract base class and providing implementations in virtual member functions
using a policy host class and defining different policies with static and member functions

Both those options are compared with calling no function at all. I am also aware of the NVI idiom usually used when designing classes that support dynamic polymorphism - the example that I used was just a benchmark for the overhead.

Here is the code that I tried to use for this purpose:

#include <iostream>
#include <vector>
#include <chrono>
#include <ctime>
#include <memory>

class Interface 
{
    public:
        virtual double calculate(double t) = 0; 
        virtual ~Interface() = default;

};

class Square
: 
    public Interface
{
    public:

       double calculate(double d)
       {
           return d*d;
       }

};

class SquareStaticFunction
{
    public:
        static double calculate(double d)
        {
            return d*d; 
        }
};

class SquareMemberFunction
{
    public:
        double calculate(double d)
        {
            return d*d; 
        }
};

template<typename Function>
class Generic
:
    public Function
{
    public:
        using Function::calculate;
};

using namespace std;

int main(int argc, const char *argv[])
{
    vector<double> test(1e06, 5); 

    unique_ptr<Interface> sUptr(new Square());

    Interface* sPtr = new Square(); 

    Generic<SquareStaticFunction> gStatic; 
    Generic<SquareMemberFunction> gMember; 

    double result;

    typedef std::chrono::high_resolution_clock Clock; 

    auto start = Clock::now();
    for (auto d : test)
    {
        result = d * d;  
    }
    auto end = Clock::now(); 

    auto noFunction = end - start; 

    start = Clock::now();  

    for (auto d : test)
    {
        result = sUptr->calculate(d);
    }
    end = Clock::now();  

    auto virtualMemberFunction = end - start; 

    start = Clock::now();  

    for (auto d : test)
    {
        result = sPtr->calculate(d);
    }
    end = Clock::now();  

    auto virtualMemberFunctionRaw = end - start; 

    start = Clock::now();
    for (auto d : test)
    {
        result = gStatic.calculate(d);  
    }
    end = Clock::now(); 

    auto staticPolicy = end - start; 

    start = Clock::now();
    for (auto d : test)
    {
        result = gMember.calculate(d);  
    }
    end = Clock::now(); 

    auto memberPolicy = end - start; 

    cout << noFunction.count() << " " << virtualMemberFunction.count() 
        << " " << virtualMemberFunctionRaw.count() 
        << " " << staticPolicy.count() 
        << " " << memberPolicy.count() << endl;

    delete sPtr; 
    sPtr = nullptr;

    return 0;
}

I compiled the code using gcc 4.8.2, and on a Linux x86_64 machine, with the following CPU model: Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz.

The virtual member function is accessed in one test via a raw pointer, and in another one via a unique_ptr. First I have compiled the code without any optimizations:

g++ -std=c++11 main.cpp -o main

and ran 1000 tests with the following shell command:

for i in {1..1000}; do ./main >> results; done

The results file I have plotted using the following gnuplot script (note the logarithmic y-axis):

set terminal png size 1600,800
set logscale y 
set key out vert right top
set out 'results.png' 
plot 'results' using 0:1 title "no function" , \
     'results' using 0:2 title "virtual member function (unique ptr)", \
     'results' using 0:3 title "virtual member function (raw ptr)", \
     'results' using 0:4 title "static policy", \
     'results' using 0:5 title 'member function policy'

For the non-optimized code, the diagram looks like this:

Non-optimized function call overhead.

Q1 Does the call to the virtual function via the unique_ptr end up being the most expensive one because it involves a redirection when dereferencing the pointer to the managed object?

Then I turned on optimization and compiled the code with:

g++ -std=c++11 -O3 main.cpp -o main

which resulted in the following diagram:

enter image description here

Q2: Are the virtual member functions most costly in this case because, when accessed via the base class pointer or reference (the vtable dispatch is turned on), it is not possible for the compiler to make them inline?

Q3: This question is what made me post all this: how is is possible in the optimized diagram that the static and member policies end up being faster than rolled out code for this simple example?

Edit: making the result volatile and compiling with optimizations turned on, makes the run times of the policies much larger, but they are similar to the raw multiplication code:

enter image description here

Edit modifying the code so that the result is added to instead of assigned (proposed by dyk in comments) without using volatile:

result += ...

results with the same diagram as for the original code.

682

asked Mar 10 '14 13:03

tmaric

1 Answers

Looking at the disassembly of -O3 -march=native -std=c++11 with your code showed that the compiler was doing "too much" optimization by detecting the unnecessary re-affectation to the same unused variable. As suggested in comments, I used += instead of =. I also initialized result = 0 and main returns result instead of 0 to be make sure the compiler computes its value. This modified code gives:

noFunction, staticPolicy and memberPolicy is lowered as mulsd, addsd, addsd, i.e. scalar SSE instruction. Clang also doesn't vectorize (with vanilla options), but Intel's icc does (it generates vector and non vector versions and jumps depending on alignment and iteration count).
virtualMemberFunction and virtualMemberFunctionRaw result in a dynamic function call (no de-virtualization and inlining)

You can see for yourself by pasting your code here.

To answer your Q1 "pointer vs unique_ptr in debug build" : in -O0 calls are not inlined automatically, in particular unique_ptr::operator-> is called explicitly with no inlining, so that's 2 function call per iteration instead of 1 for regular pointers. This difference disappears for optimized builds

To answer your Q2 "is it possible to inline virtual calls" : in this example, gcc and clang don't inline the call, because they probably don't do enough static analysis. But you can help them. For instance, with clang 3.3 (but not 3.2 and not gcc) declaring the method as const and __attribute((pure)) does the job. In gcc (4.8, pre-4.9) I tried marking the method as final and compile with -fwhole-program but this didn't remove the call. So yes in this specific case it is possible to de-virtualize, but not reliably. In general, jitted compilers (C#, Java) are better at de-virtualizing because they can make better assumption from runtime information.

164

answered Sep 20 '22 15:09

Antoine

Related questions
                            
                                WINSOCK - Setting a timeout for a connection attempt on a non existing IP?
                            
                                Making Doxygen read double-slash C++ comments as markup
                            
                                Initializing mutually-referencing objects
                            
                                C++11: Is it safe to remove individual elements from std::unordered_map while iterating?
                            
                                How to modify C++ code from user-input
                            
                                Why are std::vector and std::string's comparison operators defined as template functions?
                            
                                NVIDIA CUDA Video Encoder (NVCUVENC) input from device texture array
                            
                                Is there a way to mock QSqlQuery?
                            
                                Why will regex "([a-z])((?!\\1)[a-z])" match string "aa" in C++11?
                            
                                Ignore [clang-diagnostic-error] clang-tidy caused by 3rd party headers
                            
                                Is it legal to use an operator in a fold expression that was introduced by a using declaration?
                            
                                Why does a consteval function allow undefined behavior?
                            
                                Is there a summary of visual studio 2008 runtime versions?
                            
                                Pointer to aligned memory
                            
                                VS2015: How to hide the 'References' folder in solution explorer?
                            
                                Visual Studio 2015 -- CPU Usage Profiling Not Working?
                            
                                Finding anonymous enums with libclang
                            
                                Why move capture is not supported in C++ lambda?
                            
                                What is the equivalent of cast for concepts?
                            
                                Why first element is destroyed?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to measure the call overhead of functions?

Tags:

c++

optimization

c++11

tmaric

People also ask

1 Answers

Antoine

Recent Activity

Donate For Us