I want to improve the performance of a specific method inside a larger application.
The goal is improving latency (wall-clock time spent in a specific function), not (neccessarily) system load.
Requirements:
Tools discarded so far:
Other options which I haven't further evaluated yet:
I'd love to hear about:
I finally settled for:
The trace produced by this crude tool is hard to interpret, and I can easily imagine some tools for further processing its output making it infinitely more useful. However, this did the job for me for now, so I'm putting that project off until later ;).
For I/O bound applications you can use the --collect-systime=yes
option of callgrind.
This collects time spent in system calls (in milliseconds). So if you believe you have an I/O bottleneck, you can use these stats to identify it.
Use this method.
It is quite simple and effective at pinpointing opportunities for optimization, whether they are in CPU or IO bound code.
If you are right that the biggest opportunities are in a particular function or module, then it will find them. If they are elsewhere, it will find them.
Of the tools you mentioned and discarded, it is most similar to poor man's profiler, but still not very similar.
EDIT: Since you say it is triggered by a user interaction and blocks further input until it completes, here's how I would do it.
First, I assume it does not block a manual interrupt signal to the debugger, because otherwise you'd have no way to stop an infinite loop. Second, I would wrap a loop of 10, 100, or 1000 times around the routine in question, so it is doing it long enough to be manually interrupted.
Now, suppose it is spending some fraction of time doing I/O, like 50%. Then when you interrupt it, you have a 50% chance of catching it in the I/O. So if you catch it in the I/O, which the call stack will tell you, you can also see in great detail where the I/O is being requested from and why.
It will show you what's going on, which is almost certainly something surprising. If you see it doing something on as few as two (2) samples that you could find a way to eliminate, then you will get a considerable speedup. In fact, if you eliminate that activity, you don't know in advance how much time you will save, but on average you can expect to save fraction F = (s+1)/(n+2), where n is the total number of samples you took, and s is the number of samples that showed the activity. (Rule of Succession) Example, if you took 4 stack samples and saw the activity on 2 of them, on average it would save you F = 3/6 = 1/2, corresponding to a speedup factor of 1/(1-F) or 2.
Once you've done that, you can do it again and find something else to fix. The speedup factors multiply together like compound interest.
Then of course you remove the outer loop and "cash in" all the speedups you got.
If you are wondering how this differs from profiling, it is that by carefully examining each stack sample, and possibly related data, you can recognize activities that you could remove, where if all you've got is measurements, you are left trying to intuit what is going on. The actual amount of time you save is what it is, regardless of any measurements. The important thing is to find the problem. No matter how precisely a profiler might measure it, if you can't find it, you're not winning. These pages are full of people saying either they don't understand what their profiler is telling them, or it seems to be saying there is nothing to fix, which they are only too willing to accept. That's a case of rose-tinted glasses.
More on all that.
Todo: check out 'perf' (again)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With