Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to correctly benchmark a [templated] C++ program

< backgound>

I'm at a point where I really need to optimize C++ code. I'm writing a library for molecular simulations and I need to add a new feature. I already tried to add this feature in the past, but I then used virtual functions called in nested loops. I had bad feelings about that and the first implementation proved that this was a bad idea. However this was OK for testing the concept.

< /background>

Now I need this feature to be as fast as possible (well without assembly code or GPU calculation, this still has to be C++ and more readable than less). Now I know a little bit more about templates and class policies (from Alexandrescu's excellent book) and I think that a compile-time code generation may be the solution.

However I need to test the design before doing the huge work of implementing it into the library. The question is about the best way to test the efficiency of this new feature.

Obviously I need to turn optimizations on because without this g++ (and probably other compilers as well) would keep some unnecessary operations in the object code. I also need to make a heavy use of the new feature in the benchmark because a delta of 1e-3 second can make the difference between a good and a bad design (this feature will be called million times in the real program).

The problem is that g++ is sometimes "too smart" while optimizing and can remove a whole loop if it consider that the result of a calculation is never used. I've already seen that once when looking at the output assembly code.

If I add some printing to stdout, the compiler will then be forced to do the calculation in the loop but I will probably mostly benchmark the iostream implementation.

So how can I do a correct benchmark of a little feature extracted from a library ? Related question: is it a correct approach to do this kind of in vitro tests on a small unit or do I need the whole context ?

Thanks for advices !


There seem to be several strategies, from compiler-specific options allowing fine tuning to more general solutions that should work with every compiler like volatile or extern.

I think I will try all of these. Thanks a lot for all your answers!

like image 476
ascobol Avatar asked Jan 12 '09 14:01

ascobol


1 Answers

If you want to force any compiler to not discard a result, have it write the result to a volatile object. That operation cannot be optimized out, by definition.

template<typename T> void sink(T const& t) {
   volatile T sinkhole = t;
}

No iostream overhead, just a copy that has to remain in the generated code. Now, if you're collecting results from a lot of operations, it's best not to discard them one by one. These copies can still add some overhead. Instead, somehow collect all results in a single non-volatile object (so all individual results are needed) and then assign that result object to a volatile. E.g. if your individual operations all produce strings, you can force evaluation by adding all char values together modulo 1<<32. This adds hardly any overhead; the strings will likely be in cache. The result of the addition will subsequently be assigned-to-volatile so each char in each sting must in fact be calculated, no shortcuts allowed.

like image 83
MSalters Avatar answered Sep 23 '22 22:09

MSalters