Which metric should be used for benchmarks?

Question

When I'm benchmarking some code, either on a fixed set of input data, or on random input where the randomness doesn't affect the control flow: what is the best metric to use to assess the performance of my code?

I've always used minimum runtime over multiple runs because any deviation from the minimum will be due to the CPU being busy with unrelated things, but I couldn't find any reliable sources confirming that that's the best practice. Other obvious choices are average or median run time. (Maximum seems odd, as it will probably be dominated by unrelated CPU spikes.) Are there any better ways to make sense of the statistical data gathered from several runs?

As paxdiablo points out, if I can measure CPU time directly that would be ideal. But what do I do when I can only benchmark wall time?

As I said I was unable to find anything reliable on this, but maybe I just didn't find the right Google keywords, so if you can point me to anything existing, that would already be a great help. Also, please feel free to migrate this to Programmers.SE, if this question is too general for SO.

paxdiablo · Accepted Answer

If you're benchmarking CPU time, some systems will provide you CPU usage independent of elapsed, or wall, time.

You're right that wall time can vary depending on what the system may be doing but this generally doesn't affect CPU time.

By way of example, the time utility in Linux (and other UNIX-like operating system) report as follows:

pax> time sleep 1

real    0m1.001s
user    0m0.000s
sys     0m0.000s

The real time is wall time, a touch over a second. The user and sys time are time spent using the CPU, which is minimal in this case because the process is waiting for the sleep to finish (an action that takes practically no CPU time).

If you have this facility available, that's the one you should use.

If you don't have a facility like that, then you'll probably have to use statistical methods, such as minimising CPU usage by other processes and running you own process hundreds of times to form a decent picture.

Whether you take the average or minimum (or something weird like the average after removing outliers) will depend on whatever school of statistics you follow. You should opt for the minimum if, as you say, you're sure any variation is not due to the workload itself.

And ensuring other loads are minimised is important. If you have a rogue process taking up 97% of the CPU grunt, the minimum will be vastly skewed upwards in comparison to a mostly-idle system (that's why CPU time is so much better than wall time).

Lyndon White · Answer

I've always used minimum runtime over multiple runs because any deviation from the minimum will be due to the CPU being busy with unrelated things, but I couldn't find any reliable sources confirming that that's the best practice. Other obvious choices are average or median run time. (Maximum seems odd, as it will probably be dominated by unrelated CPU spikes.) Are there any better ways to make sense of the statistical data gathered from several runs?

Chen, J. and Revels, J., 2016. Robust benchmarking in noisy environments. arXiv preprint arXiv:1608.04295. has some moderately extensive and mathematically justified arguments about how to do robust benchmarking. The simple version is: use minimum.

As you expect because it boils down to: noise can only ever make things run slower, never faster. as such minimum not mean or median is the best estimator of the true time. Since the model is true_time + noise where noise is nonnegative, and thus the sample that is smallest must have the smallest error (since true_time doessn't chane between samples, only noise).

Which metric should be used for benchmarks?

Tags:

language-agnostic

benchmarking

Martin Ender

2 Answers

paxdiablo

Lyndon White

Recent Activity

Donate For Us

Which metric should be used for benchmarks?

Tags:

language-agnostic

benchmarking

Martin Ender

2 Answers

paxdiablo

Lyndon White

Related questions

Recent Activity

Donate For Us