Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loop optimisation techniques in C++

In order to increase the performance of our applications, we have to consider loop optimisation techniques during the development phase.

I'd like to show you some different ways to iterate over a simple std::vector<uint32_t> v:

  1. Unoptimized loop with index:

    uint64_t sum = 0;
    for (unsigned int i = 0; i < v.size(); i++)
        sum += v[i];
    
  2. Unoptimized loop with iterator:

    uint64_t sum = 0;
    std::vector<uint32_t>::const_iterator it;
    for (it = v.begin(); it != v.end(); it++)
        sum += *it;
    
  3. Cached std::vector::end iterator:

    uint64_t sum = 0;
    std::vector<uint32_t>::const_iterator it, end(v.end());
    for (it = v.begin(); it != end; it++)
        sum += *it;
    
  4. Pre-increment iterators:

    uint64_t sum = 0;
    std::vector<uint32_t>::const_iterator it, end(v.end());
    for (it = v.begin(); it != end; ++it)
        sum += *it;
    
  5. Range-based loop:

    uint64_t sum = 0;
    for (auto const &x : v)
        sum += x;
    

There are also other means to build a loop in C++; for instance by using std::for_each, BOOST_FOREACH, etc...

In your opinion, which is the best approach to increase the performance and why?

Furthermore, in performance-critical applications it could be useful to unroll the loops: again, which approach would you suggest?

like image 273
vdenotaris Avatar asked Aug 13 '13 15:08

vdenotaris


1 Answers

There's no hard and fast rule, since it depends on the implementation. If the measures I did some years back are typical, however: about the only thing which makes a difference is caching the end iterator. Pre- or post-fix makes no difference, regardless of the container and iterator type.

At the time, I didn't measure indexing (because I was comparing iterators of different types of container as well, and not all support indexing). But I would guess that if you use indexes, you should cache the results of v.size() as well.

Of course, these measures were for one compiler (g++) on one system, with a specific hardware. The only way you can know for your environment is to measure yourself.

RE your note: are you sure you have full optimization turned on. My measures showed no difference between 3 and 4, and I doubt that commpilers optimize less today.

It's very important for the optimizations here that the functions are actually inlined. If they're not, post-incrementation does require some extra copying, and typically will require an extra function call (to the copy constructor of the iterator) as well. Once the functions are inlined, however, the compiler can easily see that all this is a unessential, and (at least when I tried it) generate exactly the same code in both cases. (I'd use pre-incrementation anyway. Not because it makes a difference, but because if you don't, some idiots will come along claiming it will, despite your measures. Or maybe they're not idiots, but are just using a particularly stupid compiler.)

To tell the truth, when I did the measurements, I was surprised that caching the end iterator made a difference, even for vector, where as there was no difference between pre- and post-incrementation, even for a reverse iterator into a map. After all, end() was inlined as well; in fact, every single function used in my tests was inlined.

As to unrolling the loops: I'd probably do something like this:

std::vector<uint32_t>::const_iterator current = v.begin();
std::vector<uint32_t>::const_iterator end = v.end();
switch ( (end - current) % 4 ) {
case 3:
    sum += *current ++;
case 2:
    sum += *current ++;
case 1:
    sum += *current ++;
case 0:
}
while ( current != end ) {
    sum += current[0] + current[1] + current[2] + current[3];
    current += 4;
}

(This is a factor of 4. You can easily increase it if necessary.)

like image 164
James Kanze Avatar answered Oct 21 '22 15:10

James Kanze