Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to measure the call overhead of functions?

I wanted to measure and compare the overhead of different function calls. Different in the sense of two alternative ways to deal with extending the class while minimizing code modification:

  • using an abstract base class and providing implementations in virtual member functions
  • using a policy host class and defining different policies with static and member functions

Both those options are compared with calling no function at all. I am also aware of the NVI idiom usually used when designing classes that support dynamic polymorphism - the example that I used was just a benchmark for the overhead.

Here is the code that I tried to use for this purpose:

#include <iostream>
#include <vector>
#include <chrono>
#include <ctime>
#include <memory>

class Interface 
{
    public:
        virtual double calculate(double t) = 0; 
        virtual ~Interface() = default;

};

class Square
: 
    public Interface
{
    public:

       double calculate(double d)
       {
           return d*d;
       }

};

class SquareStaticFunction
{
    public:
        static double calculate(double d)
        {
            return d*d; 
        }
};

class SquareMemberFunction
{
    public:
        double calculate(double d)
        {
            return d*d; 
        }
};

template<typename Function>
class Generic
:
    public Function
{
    public:
        using Function::calculate;
};

using namespace std;

int main(int argc, const char *argv[])
{
    vector<double> test(1e06, 5); 

    unique_ptr<Interface> sUptr(new Square());

    Interface* sPtr = new Square(); 

    Generic<SquareStaticFunction> gStatic; 
    Generic<SquareMemberFunction> gMember; 

    double result;

    typedef std::chrono::high_resolution_clock Clock; 

    auto start = Clock::now();
    for (auto d : test)
    {
        result = d * d;  
    }
    auto end = Clock::now(); 

    auto noFunction = end - start; 

    start = Clock::now();  

    for (auto d : test)
    {
        result = sUptr->calculate(d);
    }
    end = Clock::now();  

    auto virtualMemberFunction = end - start; 

    start = Clock::now();  

    for (auto d : test)
    {
        result = sPtr->calculate(d);
    }
    end = Clock::now();  

    auto virtualMemberFunctionRaw = end - start; 

    start = Clock::now();
    for (auto d : test)
    {
        result = gStatic.calculate(d);  
    }
    end = Clock::now(); 

    auto staticPolicy = end - start; 

    start = Clock::now();
    for (auto d : test)
    {
        result = gMember.calculate(d);  
    }
    end = Clock::now(); 

    auto memberPolicy = end - start; 

    cout << noFunction.count() << " " << virtualMemberFunction.count() 
        << " " << virtualMemberFunctionRaw.count() 
        << " " << staticPolicy.count() 
        << " " << memberPolicy.count() << endl;

    delete sPtr; 
    sPtr = nullptr;

    return 0;
}

I compiled the code using gcc 4.8.2, and on a Linux x86_64 machine, with the following CPU model: Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz.

The virtual member function is accessed in one test via a raw pointer, and in another one via a unique_ptr. First I have compiled the code without any optimizations:

g++ -std=c++11 main.cpp -o main

and ran 1000 tests with the following shell command:

for i in {1..1000}; do ./main >> results; done

The results file I have plotted using the following gnuplot script (note the logarithmic y-axis):

set terminal png size 1600,800
set logscale y 
set key out vert right top
set out 'results.png' 
plot 'results' using 0:1 title "no function" , \
     'results' using 0:2 title "virtual member function (unique ptr)", \
     'results' using 0:3 title "virtual member function (raw ptr)", \
     'results' using 0:4 title "static policy", \
     'results' using 0:5 title 'member function policy'

For the non-optimized code, the diagram looks like this:

Non-optimized function call overhead.

Q1 Does the call to the virtual function via the unique_ptr end up being the most expensive one because it involves a redirection when dereferencing the pointer to the managed object?

Then I turned on optimization and compiled the code with:

g++ -std=c++11 -O3 main.cpp -o main

which resulted in the following diagram:

enter image description here

Q2: Are the virtual member functions most costly in this case because, when accessed via the base class pointer or reference (the vtable dispatch is turned on), it is not possible for the compiler to make them inline?

Q3: This question is what made me post all this: how is is possible in the optimized diagram that the static and member policies end up being faster than rolled out code for this simple example?

Edit: making the result volatile and compiling with optimizations turned on, makes the run times of the policies much larger, but they are similar to the raw multiplication code:

enter image description here

Edit modifying the code so that the result is added to instead of assigned (proposed by dyk in comments) without using volatile:

result += ...

results with the same diagram as for the original code.

like image 682
tmaric Avatar asked Mar 10 '14 13:03

tmaric


People also ask

What is function call overhead?

Therefore, when we talk about (and measure) function call overhead today, we are usually really talking about the overhead of not being able to hoist common subexpressions out of loops.

What is the overhead of calling a function from a PLT?

This overhead increases by about another 2ns per call (total time call time about 6ns) for functions called through a PLT (functions in a shared library). Show activity on this post.

What is the best way to reduce system call overhead?

Syscalls incur quite a heavy overhead compared to normal function calls. That got better with time. The vDSO implementation of the getpid () system call is pretty good at mitigating the system call overhead and is almost as fast as a normal function call.

How do the benchmarks calculate the time per function call?

The benchmarks execute the function call or syscall 10 million times in a loop. The benchmark is run 10 times. The average of those runs is then divided by 10 million to get the time per function or system call. The function call benchmark is compiled with GCC (I also did it by hand in assembler but the result was almost the same).


1 Answers

Looking at the disassembly of -O3 -march=native -std=c++11 with your code showed that the compiler was doing "too much" optimization by detecting the unnecessary re-affectation to the same unused variable. As suggested in comments, I used += instead of =. I also initialized result = 0 and main returns result instead of 0 to be make sure the compiler computes its value. This modified code gives:

  • noFunction, staticPolicy and memberPolicy is lowered as mulsd, addsd, addsd, i.e. scalar SSE instruction. Clang also doesn't vectorize (with vanilla options), but Intel's icc does (it generates vector and non vector versions and jumps depending on alignment and iteration count).
  • virtualMemberFunction and virtualMemberFunctionRaw result in a dynamic function call (no de-virtualization and inlining)

You can see for yourself by pasting your code here.

To answer your Q1 "pointer vs unique_ptr in debug build" : in -O0 calls are not inlined automatically, in particular unique_ptr::operator-> is called explicitly with no inlining, so that's 2 function call per iteration instead of 1 for regular pointers. This difference disappears for optimized builds

To answer your Q2 "is it possible to inline virtual calls" : in this example, gcc and clang don't inline the call, because they probably don't do enough static analysis. But you can help them. For instance, with clang 3.3 (but not 3.2 and not gcc) declaring the method as const and __attribute((pure)) does the job. In gcc (4.8, pre-4.9) I tried marking the method as final and compile with -fwhole-program but this didn't remove the call. So yes in this specific case it is possible to de-virtualize, but not reliably. In general, jitted compilers (C#, Java) are better at de-virtualizing because they can make better assumption from runtime information.

like image 164
Antoine Avatar answered Sep 20 '22 15:09

Antoine