I read that using a policy class for a function that will be called in a tight loop is much faster than using a polymorphic function. However, I setup this demo and the timing indicates that it is exactly the opposite!? The policy version takes between 2-3x longer than the polymorphic version.
#include <iostream>
#include <boost/timer.hpp>
// Policy version
template < typename operation_policy>
class DoOperationPolicy : public operation_policy
{
using operation_policy::Operation;
public:
void Run(const float a, const float b)
{
Operation(a,b);
}
};
class OperationPolicy_Add
{
protected:
float Operation(const float a, const float b)
{
return a + b;
}
};
// Polymorphic version
class DoOperation
{
public:
virtual float Run(const float a, const float b)= 0;
};
class OperationAdd : public DoOperation
{
public:
float Run(const float a, const float b)
{
return a + b;
}
};
int main()
{
boost::timer timer;
unsigned int numberOfIterations = 1e7;
DoOperationPolicy<OperationPolicy_Add> policy_operation;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
policy_operation.Run(1,2);
}
std::cout << timer.elapsed() << " seconds." << std::endl;
timer.restart();
DoOperation* polymorphic_operation = new OperationAdd;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
polymorphic_operation->Run(1,2);
}
std::cout << timer.elapsed() << " seconds." << std::endl;
}
Is there something wrong with the demo? Or is just incorrect that the policy should be faster?
Your benchmark is meaningless (sorry).
Making real benchmarks is hard, unfortunately, as compilers are very clever.
Things to look for here:
polymorphic_operation
is necessarily a OperationAdd
and thus directly call OperationAdd::Run
without invoking runtime dispatchIndeed, your entire benchmark code can be optimized to:
int main()
{
boost::timer timer;
std::cout << timer.elapsed() << " seconds." << std::endl;
timer.restart();
DoOperation* polymorphic_operation = new OperationAdd;
std::cout << timer.elapsed() << " seconds." << std::endl;
}
Which is when you realize that you are not timing what you'd like to...
In order to make your benchmark meaningful you need to:
To prevent devirtualization, just declare a DoOperation& Get()
function, and then in another cpp file: DoOperation& Get() { static OperationAdd O; return O; }
.
To force side-effects (only necessary if the methods are inlined): return the value and accumulate it, then display it.
In action using this program:
// test2.cpp
namespace so8746025 {
class DoOperation
{
public:
virtual float Run(const float a, const float b) = 0;
};
class OperationAdd : public DoOperation
{
public:
float Run(const float a, const float b)
{
return a + b;
}
};
class OperationAddOutOfLine: public DoOperation
{
public:
float Run(const float a, const float b);
};
float OperationAddOutOfLine::Run(const float a, const float b)
{
return a + b;
}
DoOperation& GetInline() {
static OperationAdd O;
return O;
}
DoOperation& GetOutOfLine() {
static OperationAddOutOfLine O;
return O;
}
} // namespace so8746025
// test.cpp
#include <iostream>
#include <boost/timer.hpp>
namespace so8746025 {
// Policy version
template < typename operation_policy>
struct DoOperationPolicy
{
float Run(const float a, const float b)
{
return operation_policy::Operation(a,b);
}
};
struct OperationPolicy_Add
{
static float Operation(const float a, const float b)
{
return a + b;
}
};
// Polymorphic version
class DoOperation
{
public:
virtual float Run(const float a, const float b) = 0;
};
class OperationAdd : public DoOperation
{
public:
float Run(const float a, const float b)
{
return a + b;
}
};
class OperationAddOutOfLine: public DoOperation
{
public:
float Run(const float a, const float b);
};
DoOperation& GetInline();
DoOperation& GetOutOfLine();
} // namespace so8746025
using namespace so8746025;
int main()
{
unsigned int numberOfIterations = 1e8;
DoOperationPolicy<OperationPolicy_Add> policy;
OperationAdd stackInline;
DoOperation& virtualInline = GetInline();
OperationAddOutOfLine stackOutOfLine;
DoOperation& virtualOutOfLine = GetOutOfLine();
boost::timer timer;
float result = 0;
for(unsigned int i = 0; i < numberOfIterations; ++i) {
result += policy.Run(1,2);
}
std::cout << "Policy: " << timer.elapsed() << " seconds (" << result << ")" << std::endl;
timer.restart();
result = 0;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
result += stackInline.Run(1,2);
}
std::cout << "Stack Inline: " << timer.elapsed() << " seconds (" << result << ")" << std::endl;
timer.restart();
result = 0;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
result += virtualInline.Run(1,2);
}
std::cout << "Virtual Inline: " << timer.elapsed() << " seconds (" << result << ")" << std::endl;
timer.restart();
result = 0;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
result += stackOutOfLine.Run(1,2);
}
std::cout << "Stack Out Of Line: " << timer.elapsed() << " seconds (" << result << ")" << std::endl;
timer.restart();
result = 0;
for(unsigned int i = 0; i < numberOfIterations; ++i)
{
result += virtualOutOfLine.Run(1,2);
}
std::cout << "Virtual Out Of Line: " << timer.elapsed() << " seconds (" << result << ")" << std::endl;
}
We get:
$ gcc --version
gcc (GCC) 4.3.2
$ ./testR
Policy: 0.17 seconds (6.71089e+07)
Stack Inline: 0.17 seconds (6.71089e+07)
Virtual Inline: 0.52 seconds (6.71089e+07)
Stack Out Of Line: 0.6 seconds (6.71089e+07)
Virtual Out Of Line: 0.59 seconds (6.71089e+07)
Note the subtle difference between devirtualization + inline and the absence of devirtualization.
FWIW I made it
Timings are:
DoDirect: 3.4 seconds.
Policy: 3.41 seconds.
Polymorphic: 3.4 seconds.
Ergo: there is no difference. Mainly because GCC is able to statically analyze the type of DoOperation* to be DoOperationAdd - there is vtable lookup inside the loop :)
If you wanted to benchmark reallife performance of this exact loop, instead of function invocation overhead, drop the volatile. The timings now become
DoDirect: 6.71089e+07 in 1.12 seconds.
Policy: 6.71089e+07 in 1.15 seconds.
Polymorphic: 6.71089e+07 in 3.38 seconds.
As you can see, without volatile, the compiler is able to optimize some load-store cycles away; I assume it might be doing loop unrolling+register allocation there (however I haven't inspected the machine code). The point is, that the loop as a whole can be optimized much more with the 'policy' approach than with the dynamic dispatch (i.e. the virtual method)
#include <iostream>
#include <boost/timer.hpp>
// Direct version
struct DoDirect {
static float Run(const float a, const float b) { return a + b; }
};
// Policy version
template <typename operation_policy>
struct DoOperationPolicy {
float Run(const float a, const float b) const {
return operation_policy::Operation(a,b);
}
};
struct OperationPolicy_Add {
static float Operation(const float a, const float b) {
return a + b;
}
};
// Polymorphic version
struct DoOperation {
virtual float Run(const float a, const float b) const = 0;
};
struct OperationAdd : public DoOperation {
float Run(const float a, const float b) const { return a + b; }
};
int main(int argc, const char *argv[])
{
boost::timer timer;
const unsigned long numberOfIterations = 1<<30ul;
volatile float result = 0;
for(unsigned long i = 0; i < numberOfIterations; ++i) {
result += DoDirect::Run(1,2);
}
std::cout << "DoDirect: " << result << " in " << timer.elapsed() << " seconds." << std::endl;
timer.restart();
DoOperationPolicy<OperationPolicy_Add> policy_operation;
for(unsigned long i = 0; i < numberOfIterations; ++i) {
result += policy_operation.Run(1,2);
}
std::cout << "Policy: " << result << " in " << timer.elapsed() << " seconds." << std::endl;
timer.restart();
result = 0;
DoOperation* polymorphic_operation = new OperationAdd;
for(unsigned long i = 0; i < numberOfIterations; ++i) {
result += polymorphic_operation->Run(1,2);
}
std::cout << "Polymorphic: " << result << " in " << timer.elapsed() << " seconds." << std::endl;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With