(Note: the following mentions execinfo /
backtrace, but this is only an example. The behavior in the question appeared with various libraries.)
Consider a utility library that tracks the resource allocations of some application linked with it. As functions allocate and deallocate resources, they call a tracking function that records the details of the operation, as well as some information which can be used to reconstruct the call path. Occasionally, the library is queried for a breakdown of operations by call paths.
In the settings of this question, tracking is required to be low overhead, but querying not necessarily. Consequently, for tracking I stored the minimal information identifying the call path, e.g., by calling execinfo/
backtrace. Translating the symbols, unmangling, and so forth, are deferred to queries, and are not part of this question.
To my surprise, simply calling backtrace slowed down execution by ~4000%(!) relative to the malloc call. Since backtrace takes the requested (maximal) stack depth as an argument, and it can be called by call paths with different stack depths, I tried to see how these parameters affect its performance. As far as I can see, simply calling this function in any way has a huge penalty.
For the measurements I wrote the following simple code (see also the full version):
const size_t max_backtrace_size = 100;
void *backtrace_buf[max_backtrace_size];
static void track()
{
if(backtrace_size > 0)
::backtrace(backtrace_buf, backtrace_size);
}
static void op(size_t i)
{
if(i == 0)
{
track();
return;
}
op(i - 1);
}
The first of these two functions, track, simulates the actual tracking (note that backtrace_size == 0 completely disables the call to backtrace); the second one, op is a recursion that terminates by calling track. Using these two functions, I varied the parameters and measured the results (see also the IPython Notebook).
The following figure shows the tracking time, as a function of different stack sizes, for each of calling backtrace with backtrace_size == 1 or not calling it (which has such low time that it lies on the X axis, and can hardly be seen in the figure). backtrace has a huge overhead even if called with small parameters.

The following figure further shows the overhead, now as a function of both the backtrace size and stack depth. Again, there is a huge jump simply calling this function.

Is there any technical way to reduce the overhead of finding the backtrace? (perhaps a different library, or different build settings.)
In the absence of 1., is there a completely different viable alternative to the problem at the top?
To my surprise, simply calling backtrace slowed down execution by ~4000%(!).
This statement, in itself, is meaningless. Even if backtrace() boiled down to a single instruction, it would still constitute +INF overhead if your other code contains no instructions at all.
A 40x overhead likely means that the resource you are trying to account for is exceedingly cheap to obtain. If so, perhaps not every instance of that resource must be accounted for? Could you for example record stack trace only for every Nth resource allocation?
Is there any technical way to reduce the overhead of finding the backtrace? (perhaps a different library, or different build settings.)
There are a few possibilities to consider.
Assuming you are asking about Linux/x86_64, one reason that backtrace is slow is that in the absence of frame pointers it has to find and interpret unwind information. Another reason: it is mostly used for handling exceptions, and was never optimized for speed.
If you have full control of the application that will use your library, building everything with -fno-omit-frame-pointer would allow you to use much faster frame-based unwinder.
If you can't do that, libunwind backtrace could be significantly faster than GLIBC one (though it still can't beat frame-based unwinder).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With