I'm working on scientific code that is very performance-critical. An initial version of the code has been written and tested, and now, with profiler in hand, it's time to start shaving cycles from the hot spots.
It's well-known that some optimizations, e.g. loop unrolling, are handled these days much more effectively by the compiler than by a programmer meddling by hand. Which techniques are still worthwhile? Obviously, I'll run everything I try through a profiler, but if there's conventional wisdom as to what tends to work and what doesn't, it would save me significant time.
I know that optimization is very compiler- and architecture- dependent. I'm using Intel's C++ compiler targeting the Core 2 Duo, but I'm also interested in what works well for gcc, or for "any modern compiler."
Here are some concrete ideas I'm considering:
std::priority_queue
) whose manipulation is taking a lot of total time. Is this something worth looking into, or is the STL implementation already likely the fastest possible?std::vector
s whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?Lastly, to nip certain kinds of answers in the bud:
Compiler optimization is generally implemented using a sequence of optimizing transformations, algorithms which take a program and transform it to produce a semantically equivalent output program that uses fewer resources or executes faster.
Optimization strategy is used to search and optimize the transform parameters for the maximization of the similarity value between the warped source image and the target image.
The code optimization in the synthesis phase is a program transformation technique, which tries to improve the intermediate code by making it consume fewer resources (i.e. CPU, Memory) so that faster-running machine code will result.
Is there any benefit to replacing STL containers/algorithms with hand-rolled ones? In particular, my program includes a very large priority queue (currently a std::priority_queue) whose manipulation is taking a lot of total time. Is this something worth looking into, or is the STL implementation already likely the fastest possible?
I assume you're aware that the STL containers rely on copying the elements. In certain cases, this can be a significant loss. Store pointers and you may see an increase in performance if you do a lot of container manipulation. On the other hand, it may reduce cache locality and hurt you. Another option is to use specialized allocators.
Certain containers (e.g. map
, set
, list
) rely on lots of pointer manipulation. Although counterintuitive, it can often lead to faster code to replace them with vector
. The resulting algorithm might go from O(1)
or O(log n)
to O(n)
, but due to cache locality it can be much faster in practice. Profile to be sure.
You mentioned you're using priority_queue, which I would imagine pays a lot for rearranging the elements, especially if they're large. You can try switching the underlying container (maybe deque
or specialized). I'd almost certainly store pointers - again, profile to be sure.
Along similar lines, for a std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
Again, this may help a small amount, depending on the use case. You can avoid the heap allocation, but only if you don't need your array to outlive the stack... or you could reserve()
the size in the vector
so there is less copying on reallocation.
I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)?
You could look at the generated assembly to see if RVO is applied, but if you return pointer or reference, you can be sure there's no copy. Whether this will help is dependent on what you're doing - e.g. can't return references to temporaries. You can use arenas to allocate and reuse objects, so not to pay a large heap penalty.
How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
I've seen dramatic (seriously dramatic) speedups in this realm. I saw more improvements from this than I later saw from multithreading my code. Things may have changed in the five years since - only one way to be sure - profile.
On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?
Use explicit
on your single argument constructors. Temporary object construction and destruction may be hidden in your code.
Be aware of hidden copy constructor calls on large objects. In some cases, consider replacing with pointers.
Profile, profile, profile. Tune areas that are bottlenecks.
Take a look at the excellent Pitfalls of Object-Oriented Programming slides for some info about restructuring code for locality. In my experience getting better locality is almost always the biggest win.
General process:
Some other specific things:
restrict
if your compiler has it. (Examine the disasm here too!)reserve
on std::vector
when you can, or use std::array
when you know the size at compile-time.const int threshold = m_currentThreshold;
or perhaps Thing * const pThing = pStructHoldingThing->pThing;
Fortunately you can usually see things that need this treatment in the disassembly view. This also helps with debugging later (makes the watch/locals window behave much more nicely in debug builds)!WRT your std::priority_queue
question: inserting things into a vector (the default backend for a priority_queue) tends to move a lot of elements around. If you can break up into phases, where you insert data, then sort it, then read it once it's sorted, you'll probably be a lot better off. Although you'll definitely lose locality, you may find a more self-ordering structure like a std::map or std::set worth the overhead -- but this is really dependent on your usage patterns.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With