In order to increase the performance of our applications, we have to consider loop optimisation techniques during the development phase.
I'd like to show you some different ways to iterate over a simple std::vector<uint32_t> v
:
Unoptimized loop with index:
uint64_t sum = 0;
for (unsigned int i = 0; i < v.size(); i++)
sum += v[i];
Unoptimized loop with iterator:
uint64_t sum = 0;
std::vector<uint32_t>::const_iterator it;
for (it = v.begin(); it != v.end(); it++)
sum += *it;
Cached std::vector::end
iterator:
uint64_t sum = 0;
std::vector<uint32_t>::const_iterator it, end(v.end());
for (it = v.begin(); it != end; it++)
sum += *it;
Pre-increment iterators:
uint64_t sum = 0;
std::vector<uint32_t>::const_iterator it, end(v.end());
for (it = v.begin(); it != end; ++it)
sum += *it;
Range-based loop:
uint64_t sum = 0;
for (auto const &x : v)
sum += x;
There are also other means to build a loop in C++; for instance by using std::for_each
, BOOST_FOREACH
, etc...
In your opinion, which is the best approach to increase the performance and why?
Furthermore, in performance-critical applications it could be useful to unroll the loops: again, which approach would you suggest?
There's no hard and fast rule, since it depends on the implementation. If the measures I did some years back are typical, however: about the only thing which makes a difference is caching the end iterator. Pre- or post-fix makes no difference, regardless of the container and iterator type.
At the time, I didn't measure indexing (because I was comparing
iterators of different types of container as well, and not all
support indexing). But I would guess that if you use indexes,
you should cache the results of v.size()
as well.
Of course, these measures were for one compiler (g++) on one system, with a specific hardware. The only way you can know for your environment is to measure yourself.
RE your note: are you sure you have full optimization turned on. My measures showed no difference between 3 and 4, and I doubt that commpilers optimize less today.
It's very important for the optimizations here that the functions are actually inlined. If they're not, post-incrementation does require some extra copying, and typically will require an extra function call (to the copy constructor of the iterator) as well. Once the functions are inlined, however, the compiler can easily see that all this is a unessential, and (at least when I tried it) generate exactly the same code in both cases. (I'd use pre-incrementation anyway. Not because it makes a difference, but because if you don't, some idiots will come along claiming it will, despite your measures. Or maybe they're not idiots, but are just using a particularly stupid compiler.)
To tell the truth, when I did the measurements, I was surprised
that caching the end iterator made a difference, even for
vector, where as there was no difference between pre- and
post-incrementation, even for a reverse iterator into a map.
After all, end()
was inlined as well; in fact, every single
function used in my tests was inlined.
As to unrolling the loops: I'd probably do something like this:
std::vector<uint32_t>::const_iterator current = v.begin();
std::vector<uint32_t>::const_iterator end = v.end();
switch ( (end - current) % 4 ) {
case 3:
sum += *current ++;
case 2:
sum += *current ++;
case 1:
sum += *current ++;
case 0:
}
while ( current != end ) {
sum += current[0] + current[1] + current[2] + current[3];
current += 4;
}
(This is a factor of 4. You can easily increase it if necessary.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With