Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Effective optimization strategies on modern C++ compilers

I'm working on scientific code that is very performance-critical. An initial version of the code has been written and tested, and now, with profiler in hand, it's time to start shaving cycles from the hot spots.

It's well-known that some optimizations, e.g. loop unrolling, are handled these days much more effectively by the compiler than by a programmer meddling by hand. Which techniques are still worthwhile? Obviously, I'll run everything I try through a profiler, but if there's conventional wisdom as to what tends to work and what doesn't, it would save me significant time.

I know that optimization is very compiler- and architecture- dependent. I'm using Intel's C++ compiler targeting the Core 2 Duo, but I'm also interested in what works well for gcc, or for "any modern compiler."

Here are some concrete ideas I'm considering:

  • Is there any benefit to replacing STL containers/algorithms with hand-rolled ones? In particular, my program includes a very large priority queue (currently a std::priority_queue) whose manipulation is taking a lot of total time. Is this something worth looking into, or is the STL implementation already likely the fastest possible?
  • Along similar lines, for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
  • I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)?
  • How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
  • Given the scientific nature of the program, floating-point numbers are used everywhere. A significant bottleneck in my code used to be conversions from floating point to integers: the compiler would emit code to save the current rounding mode, change it, perform the conversion, then restore the old rounding mode --- even though nothing in the program ever changed the rounding mode! Disabling this behavior significantly sped up my code. Are there any similar floating-point-related gotchas I should be aware of?
  • One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?
  • On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?

Lastly, to nip certain kinds of answers in the bud:

  • I understand that optimization has a cost in terms of complexity, reliability, and maintainability. For this particular application, increased performance is worth these costs.
  • I understand that the best optimizations are often to improve the high-level algorithms, and this has already been done.
like image 961
user168715 Avatar asked May 28 '10 21:05

user168715


People also ask

How are compilers optimized?

Compiler optimization is generally implemented using a sequence of optimizing transformations, algorithms which take a program and transform it to produce a semantically equivalent output program that uses fewer resources or executes faster.

What are the optimization strategies?

Optimization strategy is used to search and optimize the transform parameters for the maximization of the similarity value between the warped source image and the target image.

What is code compiler optimization techniques?

The code optimization in the synthesis phase is a program transformation technique, which tries to improve the intermediate code by making it consume fewer resources (i.e. CPU, Memory) so that faster-running machine code will result.


2 Answers

Is there any benefit to replacing STL containers/algorithms with hand-rolled ones? In particular, my program includes a very large priority queue (currently a std::priority_queue) whose manipulation is taking a lot of total time. Is this something worth looking into, or is the STL implementation already likely the fastest possible?

I assume you're aware that the STL containers rely on copying the elements. In certain cases, this can be a significant loss. Store pointers and you may see an increase in performance if you do a lot of container manipulation. On the other hand, it may reduce cache locality and hurt you. Another option is to use specialized allocators.

Certain containers (e.g. map, set, list) rely on lots of pointer manipulation. Although counterintuitive, it can often lead to faster code to replace them with vector. The resulting algorithm might go from O(1) or O(log n) to O(n), but due to cache locality it can be much faster in practice. Profile to be sure.

You mentioned you're using priority_queue, which I would imagine pays a lot for rearranging the elements, especially if they're large. You can try switching the underlying container (maybe deque or specialized). I'd almost certainly store pointers - again, profile to be sure.

Along similar lines, for a std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?

Again, this may help a small amount, depending on the use case. You can avoid the heap allocation, but only if you don't need your array to outlive the stack... or you could reserve() the size in the vector so there is less copying on reallocation.

I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)?

You could look at the generated assembly to see if RVO is applied, but if you return pointer or reference, you can be sure there's no copy. Whether this will help is dependent on what you're doing - e.g. can't return references to temporaries. You can use arenas to allocate and reuse objects, so not to pay a large heap penalty.

How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?

I've seen dramatic (seriously dramatic) speedups in this realm. I saw more improvements from this than I later saw from multithreading my code. Things may have changed in the five years since - only one way to be sure - profile.

On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?

  • Use explicit on your single argument constructors. Temporary object construction and destruction may be hidden in your code.

  • Be aware of hidden copy constructor calls on large objects. In some cases, consider replacing with pointers.

  • Profile, profile, profile. Tune areas that are bottlenecks.

like image 95
Stephen Avatar answered Sep 22 '22 04:09

Stephen


Take a look at the excellent Pitfalls of Object-Oriented Programming slides for some info about restructuring code for locality. In my experience getting better locality is almost always the biggest win.

General process:

  • Learn to love the Disassembly View in your debugger, or have your build system generate the intermediate assembly files (.s) if at all possible. Keep an eye on changes or for things that look egregious -- even without familiarity with a given instruction set architecture, you should be able to see some things fairly clearly! (I sometimes check in a series of .s files with corresponding .cpp/.c changes, just to leverage the lovely tools from my SCM to watch the code and corresponding asm change over time.)
  • Get a profiler that can watch your CPU's performance counters, or can at least guess at cache misses. (AMD CodeAnalyst, cachegrind, vTune, etc.)

Some other specific things:

  • Understand strict aliasing. Once you do, make use of restrict if your compiler has it. (Examine the disasm here too!)
  • Check out different floating point modes on your processor and compiler. If you don't need the denormalized range, choosing a mode without this can result in better performance. (It sounds like you've already done some things in this area, based on your discussion of rounding modes.)
  • Definitely avoid allocs: call reserve on std::vector when you can, or use std::array when you know the size at compile-time.
  • Use memory pools to increase locality and decrease alloc/free overhead; also to ensure cacheline alignment and prevent ping-ponging.
  • Use frame allocators if you're allocating things in predictable patterns, and can afford to deallocate everything in one go.
  • Do be aware of invariants. Something you know is invariant may not be to the compiler, for example a use of a struct or class member in a loop. I find the single easiest way to fall into the correct habit here is to give a name to everything, and prefer to name things outside of loops. E.g. const int threshold = m_currentThreshold; or perhaps Thing * const pThing = pStructHoldingThing->pThing; Fortunately you can usually see things that need this treatment in the disassembly view. This also helps with debugging later (makes the watch/locals window behave much more nicely in debug builds)!
  • Avoid writes in loops if possible -- accumulate first, then write, or batch a few writes together. YMMV, of course.

WRT your std::priority_queue question: inserting things into a vector (the default backend for a priority_queue) tends to move a lot of elements around. If you can break up into phases, where you insert data, then sort it, then read it once it's sorted, you'll probably be a lot better off. Although you'll definitely lose locality, you may find a more self-ordering structure like a std::map or std::set worth the overhead -- but this is really dependent on your usage patterns.

like image 29
leander Avatar answered Sep 22 '22 04:09

leander