I wanted to measure and compare the overhead of different function calls. Different in the sense of two alternative ways to deal with extending the class while minimizing code modification:
Both those options are compared with calling no function at all. I am also aware of the NVI idiom usually used when designing classes that support dynamic polymorphism - the example that I used was just a benchmark for the overhead.
Here is the code that I tried to use for this purpose:
#include <iostream>
#include <vector>
#include <chrono>
#include <ctime>
#include <memory>
class Interface
{
public:
virtual double calculate(double t) = 0;
virtual ~Interface() = default;
};
class Square
:
public Interface
{
public:
double calculate(double d)
{
return d*d;
}
};
class SquareStaticFunction
{
public:
static double calculate(double d)
{
return d*d;
}
};
class SquareMemberFunction
{
public:
double calculate(double d)
{
return d*d;
}
};
template<typename Function>
class Generic
:
public Function
{
public:
using Function::calculate;
};
using namespace std;
int main(int argc, const char *argv[])
{
vector<double> test(1e06, 5);
unique_ptr<Interface> sUptr(new Square());
Interface* sPtr = new Square();
Generic<SquareStaticFunction> gStatic;
Generic<SquareMemberFunction> gMember;
double result;
typedef std::chrono::high_resolution_clock Clock;
auto start = Clock::now();
for (auto d : test)
{
result = d * d;
}
auto end = Clock::now();
auto noFunction = end - start;
start = Clock::now();
for (auto d : test)
{
result = sUptr->calculate(d);
}
end = Clock::now();
auto virtualMemberFunction = end - start;
start = Clock::now();
for (auto d : test)
{
result = sPtr->calculate(d);
}
end = Clock::now();
auto virtualMemberFunctionRaw = end - start;
start = Clock::now();
for (auto d : test)
{
result = gStatic.calculate(d);
}
end = Clock::now();
auto staticPolicy = end - start;
start = Clock::now();
for (auto d : test)
{
result = gMember.calculate(d);
}
end = Clock::now();
auto memberPolicy = end - start;
cout << noFunction.count() << " " << virtualMemberFunction.count()
<< " " << virtualMemberFunctionRaw.count()
<< " " << staticPolicy.count()
<< " " << memberPolicy.count() << endl;
delete sPtr;
sPtr = nullptr;
return 0;
}
I compiled the code using gcc 4.8.2, and on a Linux x86_64 machine, with the following CPU model: Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz.
The virtual member function is accessed in one test via a raw pointer, and in another one via a unique_ptr
. First I have compiled the code without any optimizations:
g++ -std=c++11 main.cpp -o main
and ran 1000 tests with the following shell command:
for i in {1..1000}; do ./main >> results; done
The results file I have plotted using the following gnuplot
script (note the logarithmic y-axis):
set terminal png size 1600,800
set logscale y
set key out vert right top
set out 'results.png'
plot 'results' using 0:1 title "no function" , \
'results' using 0:2 title "virtual member function (unique ptr)", \
'results' using 0:3 title "virtual member function (raw ptr)", \
'results' using 0:4 title "static policy", \
'results' using 0:5 title 'member function policy'
For the non-optimized code, the diagram looks like this:
Q1 Does the call to the virtual function via the unique_ptr
end up being the most expensive one because it involves a redirection when dereferencing the pointer to the managed object?
Then I turned on optimization and compiled the code with:
g++ -std=c++11 -O3 main.cpp -o main
which resulted in the following diagram:
Q2: Are the virtual member functions most costly in this case because, when accessed via the base class pointer or reference (the vtable dispatch is turned on), it is not possible for the compiler to make them inline?
Q3: This question is what made me post all this: how is is possible in the optimized diagram that the static and member policies end up being faster than rolled out code for this simple example?
Edit: making the result
volatile
and compiling with optimizations turned on, makes the run times of the policies much larger, but they are similar to the raw multiplication code:
Edit modifying the code so that the result is added to instead of assigned (proposed by dyk in comments) without using volatile
:
result += ...
results with the same diagram as for the original code.
Therefore, when we talk about (and measure) function call overhead today, we are usually really talking about the overhead of not being able to hoist common subexpressions out of loops.
This overhead increases by about another 2ns per call (total time call time about 6ns) for functions called through a PLT (functions in a shared library). Show activity on this post.
Syscalls incur quite a heavy overhead compared to normal function calls. That got better with time. The vDSO implementation of the getpid () system call is pretty good at mitigating the system call overhead and is almost as fast as a normal function call.
The benchmarks execute the function call or syscall 10 million times in a loop. The benchmark is run 10 times. The average of those runs is then divided by 10 million to get the time per function or system call. The function call benchmark is compiled with GCC (I also did it by hand in assembler but the result was almost the same).
Looking at the disassembly of -O3 -march=native -std=c++11
with your code showed that the compiler was doing "too much" optimization by detecting the unnecessary re-affectation to the same unused variable. As suggested in comments, I used +=
instead of =
. I also initialized result = 0
and main
returns result
instead of 0
to be make sure the compiler computes its value. This modified code gives:
noFunction
, staticPolicy
and memberPolicy
is lowered as mulsd
, addsd
, addsd
, i.e. scalar SSE instruction. Clang also doesn't vectorize (with vanilla options), but Intel's icc does (it generates vector and non vector versions and jumps depending on alignment and iteration count).virtualMemberFunction
and virtualMemberFunctionRaw
result in a dynamic function call (no de-virtualization and inlining)You can see for yourself by pasting your code here.
To answer your Q1 "pointer vs unique_ptr
in debug build" : in -O0
calls are not inlined automatically, in particular unique_ptr::operator->
is called explicitly with no inlining, so that's 2 function call per iteration instead of 1 for regular pointers. This difference disappears for optimized builds
To answer your Q2 "is it possible to inline virtual calls" : in this example, gcc and clang don't inline the call, because they probably don't do enough static analysis. But you can help them. For instance, with clang 3.3 (but not 3.2 and not gcc) declaring the method as const
and __attribute((pure))
does the job. In gcc (4.8, pre-4.9) I tried marking the method as final
and compile with -fwhole-program
but this didn't remove the call. So yes in this specific case it is possible to de-virtualize, but not reliably. In general, jitted compilers (C#, Java) are better at de-virtualizing because they can make better assumption from runtime information.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With