Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference in performance: std::accumulate vs std::inner_product vs Loop

Today, I want to share something that was blowing my mind when I tried to implement this simple operation:

enter image description here

I found different ways to perform the same operation:

  1. By using the std::inner_product.
  2. Implementing a predicate and using the std::accumulate function.
  3. Using a loop in C style.

I wanted to perform some benchmark by using Quick Bench and enabling all the optimizations.

First of all, I compared the two C++ alternatives with floating values. This is the code used by using std::accumulate:

const auto predicate = [](const double previous, const double current) {
    return previous + current * current;
};
const auto result = std::accumulate(input.cbegin(), input.cend(), 0, predicate);

Versus this code by using the std::inner_product functionality:

const auto result = std::inner_product(input.cbegin(), input.cend(), input.cbegin(), 1);

After running the benchmark with all the optimization enabled, I got this result:

enter image description here

Both algorithms seem to reach the same performance. I did want to go further and try the C implementation:

double result = 0;
for (auto i = 0; i < input.size(); ++i) {
  result += input[i] * input[i];
}

And surprisingly, I found:

enter image description here

I was not expecting this result. I was sure there is something wrong so I did check the GCC implementation:

template<typename _InputIterator1, typename _InputIterator2, typename _Tp>
inline _Tp
inner_product(_InputIterator1 __first1, _InputIterator1 __last1,
      _InputIterator2 __first2, _Tp __init)
{
  // concept requirements
  __glibcxx_function_requires(_InputIteratorConcept<_InputIterator1>)
  __glibcxx_function_requires(_InputIteratorConcept<_InputIterator2>)
  __glibcxx_requires_valid_range(__first1, __last1);

  for (; __first1 != __last1; ++__first1, (void)++__first2)
__init = __init + (*__first1 * *__first2);
  return __init;
}

I found that It was doing the same as the C implementation. After reviewing the implementation, I discovered something weird, (or at least I was not expecting to have that significant impact): in all the internal accumulations, it was doing a cast from the iterator value_type to the type of the initial value.

In my case, I was initializing the initial values to 0 or 1, the values were considered integers and in each accumulation, the compiler was doing the casting. In the different test cases, my input array stores truncated floating points, so the result did not change.

After updating the initial value to a double type:

const auto result = std::accumulate(input.cbegin(), input.cend(), 0.0, predicate);

And:

const auto result = std::inner_product(input.cbegin(), input.cend(), input.cbegin(), 0.0);

I got the expected result:

enter image description here

Now, I understand that leaving the initial value to be an independent type from the underlying type of the iterator may make the function more flexible and allow to do more things. But,

If I am accumulating elements of an array, I am expecting to get the same type as a result. Same for the inner product.

Should it be the default behaviour?

Why did the standard decide to perform it in this way?

like image 670
mohabouje Avatar asked Sep 04 '18 13:09

mohabouje


People also ask

Is accumulate faster than for loop?

As for the Range For Loop, the address is incremented by 16 (= 8 + 8 because of loop unrolling), multiplication is not used to calculate the address. Accumulator code use the same tactics. Earlier in the decade, C programmers were baffled as to why std::accumulate was faster than for loop. Now we know the reason.

Is std :: accumulate faster?

As I expect, the manually unrolled loop is the fastest one, but more interesting is that std::accumulate is much slower than simple loop.


1 Answers

If I am accumulating elements of an array, I am expecting to get the same type as a result.

Your expectation is wrong (though it is not quite clear what "same type as result" means), as you can clearly see from std::accumulate documentation:

template< class InputIt, class T >
T accumulate( InputIt first, InputIt last, T init );

template< class InputIt, class T, class BinaryOperation >
T accumulate( InputIt first, InputIt last, T init,
              BinaryOperation op );

return type is exactly the same type you use for initial value. The same effect you can have on the loop:

auto result = 0; // vs auto result = 0.0;
for (auto i = 0; i < input.size(); ++i) {
  result += input[i] * input[i];
}

Why did the standard decide to perform it in this way?

Because this way you can decide what type you use to aggregate. Note std::accumulate can be used for left fold and cases when T not equal to std::iterator_traits<InputIt>::value_type not less often (probably even more) than when they match.

like image 132
Slava Avatar answered Sep 22 '22 14:09

Slava