I'm working on a complex framework which uses std::function<>
as argument of many functions. By profiling i found one of the performance problem the following.
Can somebody explain me why the Loop3a is so slow? I expected that the inlining will be used and the time will be same. The same for the assembly. Is there any way to improve performance or different way? Does the C++17 makes any change in that way?
#include <iostream>
#include <functional>
#include <chrono>
#include <cmath>
static const unsigned N = 300;
struct Loop3a
{
void impl()
{
sum = 0.0;
for (unsigned i = 1; i <= N; ++i) {
for (unsigned j = 1; j <= N; ++j) {
for (unsigned k = 1; k <= N; ++k) {
sum += fn(i, j, k);
}
}
}
}
std::function<double(double, double, double)> fn = [](double a, double b, double c) {
const auto subFn = [](double x, double y) { return x / (y+1); };
return sin(a) + log(subFn(b, c));
};
double sum;
};
struct Loop3b
{
void impl()
{
sum = 0.0;
for (unsigned i = 1; i <= N; ++i) {
for (unsigned j = 1; j <= N; ++j) {
for (unsigned k = 1; k <= N; ++k) {
sum += sin((double)i) + log((double)j / (k+1));
}
}
}
}
double sum;
};
int main()
{
using Clock = std::chrono::high_resolution_clock;
using TimePoint = std::chrono::time_point<Clock>;
TimePoint start, stop;
Loop3a a;
Loop3b b;
start = Clock::now();
a.impl();
stop = Clock::now();
std::cout << "A: " << std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count();
std::cout << "ms\n";
start = Clock::now();
b.impl();
stop = Clock::now();
std::cout << "B: " << std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count();
std::cout << "ms\n";
return a.sum == b.sum;
}
Sample output using g++5.4 with "-O2 -std=c++14":
A: 1794ms
B: 906ms
In the profiler i can see many of this internals:
double&& std::forward<double>(std::remove_reference<double>::type&)
std::_Function_handler<double (double, double, double), Loop3a::fn::{lambda(double, double, double)#1}>::_M_invoke(std::_Any_data const&, double, double, double)
Loop3a::fn::{lambda(double, double, double)#1}* const& std::_Any_data::_M_access<Loop3a::fn::{lambda(double, double, double)#1}*>() const
We should not use functions that are I/O bound as inline functions. When large code is used in some function, then we should avoid the inline. When recursion is used, inline function may not work properly. One point we have to keep in mind, that inline is not a command. It is a request. So we are requesting the compiler to use inline functions.
A compiler may inline (i.e. replace a call to the function with code that performs that action of that function) any function call that it chooses.
The preceding example shows that you don't have to declare these functions explicitly with the inline specifier. Using inline in the function definition causes it to be an inline function.
Inline Functions in C++ 1 If a function contains a loop. (for, while, do-while) 2 If a function contains static variables. 3 If a function is recursive. 4 If a function return type is other than void, and the return statement doesn’t exist in function body. 5 If a function contains switch or goto statement.
std::function
is not a zero-runtime-cost abstraction. It is a type-erased wrapper that has a virtual
-call like cost when invoking operator()
and could also potentially heap-allocate (which could mean a cache-miss per call).
The compiler will most likely not be able to inline it.
If you want to store your function objects in such a way that does not introduce additional overhead and that allows the compiler to inline it, you should use a template parameters. This is not always possible, but might fit your use case.
I wrote an article that's related to the subject:
"Passing functions to functions"
It contains some benchmarks that show how much assembly is generated for std::function
compared to a template parameter and other solutions.
std::function
has roughly a virtual call overhead. This is small, but if your operation is even smaller it can be large.
In your case, you are looping heavily over the std::function
, calling it with a set of predictible values, and probably doing next to nothing within it.
We can fix this.
template<class F>
std::function<double(double, double, double, unsigned)>
repeated_sum( F&& f ) {
return
[f=std::forward<F>(f)]
(double a, double b, double c, unsigned count)
{
double sum = 0.0;
for (unsigned i = 0; i < count; ++i)
sum += f(a,b,c+i);
return sum;
};
}
then
std::function<double(double, double, double, unsigned)> fn =
repeated_sum
(
[](double a, double b, double c) {
const auto subFn = [](double x, double y) { return x / (y+1); };
return sin(a) + log(subFn(b, c));
}
);
now repeating_function
takes a double, double, double
function and returns a double, double, double, unsigned
. This new function calls the previous one repeatedly, each time with the last coordinate increased by 1.
We then replace impl
as follows:
void impl()
{
sum = 0.0;
for (unsigned i = 1; i <= N; ++i) {
for (unsigned j = 1; j <= N; ++j) {
fn(i,j,0,N);
}
}
}
where we replace the "lowest level loop" with a single call to our repeating function.
This will reduce the virtual call overhead by a factor of 300, which basically makes it disappear. Basically, 50% of the time/300 = 0.15% of the time (actually 0.3%, as we reduce the time by a factor of 2 which doubles the contribution, but who is counting tenths of a percent?)
Now in the real situation you may not be calling it with 300 adjacent values. But usually there is some pattern.
What we did above was move some of the logic controlling how fn
was called inside fn
. If you can do this enough, you can remove the virtual call overhead from consideration.
std::function
overhead is mostly ignorable unless you want to call it on the order of billions of times per second, what I call "per-pixel" operations. Replace such operations with "per-scanline" -- per line of adjacent pixels -- and the overhead stops being a concern.
This can require exposing some of the logic on how the function object is used "in a header". Careful choice of what logic you expose can make it relatively generic in my experience.
Finally, note that it is possible to inline std::function
and compilers are getting better at it. But it is hard, and fragile. Relying on it at this point is not wise.
There is another approach.
template<class F>
struct looper_t {
F fn;
double operator()( unsigned a, unsigned b, unsigned c ) const {
double sum = 0;
for (unsigned i = 0; i < a; ++i)
for (unsigned j = 0; j < b; ++j)
for (unsigned k = 0; k < c; ++k)
sum += fn(i,j,k);
return sum;
}
};
template<class F>
looper_t<F> looper( F f ) {
return {std::move(f)};
}
now we write our looper:
struct Loop3c {
std::function<double(unsigned, unsigned, unsigned)> fn = looper(
[](double a, double b, double c) {
const auto subFn = [](double x, double y) { return x / (y+1); };
return sin(a) + log(subFn(b, c));
}
);
double sum = 0;
void impl() {
sum=fn(N,N,N);
}
};
which erases the entire operation of 3 dimensional looping, instead of just the trailing dimension.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With