Today, I want to share something that was blowing my mind when I tried to implement this simple operation:
I found different ways to perform the same operation:
std::inner_product
.std::accumulate
function.I wanted to perform some benchmark by using Quick Bench and enabling all the optimizations.
First of all, I compared the two C++ alternatives with floating values. This is the code used by using std::accumulate
:
const auto predicate = [](const double previous, const double current) {
return previous + current * current;
};
const auto result = std::accumulate(input.cbegin(), input.cend(), 0, predicate);
Versus this code by using the std::inner_product
functionality:
const auto result = std::inner_product(input.cbegin(), input.cend(), input.cbegin(), 1);
After running the benchmark with all the optimization enabled, I got this result:
Both algorithms seem to reach the same performance. I did want to go further and try the C implementation:
double result = 0;
for (auto i = 0; i < input.size(); ++i) {
result += input[i] * input[i];
}
And surprisingly, I found:
I was not expecting this result. I was sure there is something wrong so I did check the GCC implementation:
template<typename _InputIterator1, typename _InputIterator2, typename _Tp>
inline _Tp
inner_product(_InputIterator1 __first1, _InputIterator1 __last1,
_InputIterator2 __first2, _Tp __init)
{
// concept requirements
__glibcxx_function_requires(_InputIteratorConcept<_InputIterator1>)
__glibcxx_function_requires(_InputIteratorConcept<_InputIterator2>)
__glibcxx_requires_valid_range(__first1, __last1);
for (; __first1 != __last1; ++__first1, (void)++__first2)
__init = __init + (*__first1 * *__first2);
return __init;
}
I found that It was doing the same as the C implementation. After reviewing the implementation, I discovered something weird, (or at least I was not expecting to have that significant impact): in all the internal accumulations, it was doing a cast from the iterator value_type to the type of the initial value.
In my case, I was initializing the initial values to 0 or 1, the values were considered integers and in each accumulation, the compiler was doing the casting. In the different test cases, my input array stores truncated floating points, so the result did not change.
After updating the initial value to a double type:
const auto result = std::accumulate(input.cbegin(), input.cend(), 0.0, predicate);
And:
const auto result = std::inner_product(input.cbegin(), input.cend(), input.cbegin(), 0.0);
I got the expected result:
Now, I understand that leaving the initial value to be an independent type from the underlying type of the iterator may make the function more flexible and allow to do more things. But,
If I am accumulating elements of an array, I am expecting to get the same type as a result. Same for the inner product.
Should it be the default behaviour?
Why did the standard decide to perform it in this way?
As for the Range For Loop, the address is incremented by 16 (= 8 + 8 because of loop unrolling), multiplication is not used to calculate the address. Accumulator code use the same tactics. Earlier in the decade, C programmers were baffled as to why std::accumulate was faster than for loop. Now we know the reason.
As I expect, the manually unrolled loop is the fastest one, but more interesting is that std::accumulate is much slower than simple loop.
If I am accumulating elements of an array, I am expecting to get the same type as a result.
Your expectation is wrong (though it is not quite clear what "same type as result" means), as you can clearly see from std::accumulate documentation:
template< class InputIt, class T >
T accumulate( InputIt first, InputIt last, T init );
template< class InputIt, class T, class BinaryOperation >
T accumulate( InputIt first, InputIt last, T init,
BinaryOperation op );
return type is exactly the same type you use for initial value. The same effect you can have on the loop:
auto result = 0; // vs auto result = 0.0;
for (auto i = 0; i < input.size(); ++i) {
result += input[i] * input[i];
}
Why did the standard decide to perform it in this way?
Because this way you can decide what type you use to aggregate. Note std::accumulate
can be used for left fold and cases when T
not equal to std::iterator_traits<InputIt>::value_type
not less often (probably even more) than when they match.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With