Modeling distribution of performance measurements

Tags:

How would you mathematically model the distribution of repeated real life performance measurements - "Real life" meaning you are not just looping over the code in question, but it is just a short snippet within a large application running in a typical user scenario?

My experience shows that you usually have a peak around the average execution time that can be modeled adequately with a Gaussian distribution. In addition, there's a "long tail" containing outliers - often with a multiple of the average time. (The behavior is understandable considering the factors contributing to first execution penalty).

My goal is to model aggregate values that reasonably reflect this, and can be calculated from aggregate values (like for the Gaussian, calculate mu and sigma from N, sum of values and sum of squares). In other terms, number of repetitions is unlimited, but memory and calculation requirements should be minimized.

A normal Gaussian distribution can't model the long tail appropriately and will have the average biased strongly even by a very small percentage of outliers.

I am looking for ideas, especially if this has been attempted/analysed before. I've checked various distributions models, and I think I could work out something, but my statistics is rusty and I might end up with an overblown solution. Oh, a complete shrink-wrapped solution would be fine, too ;)

Other aspects / ideas: Sometimes you get "two humps" distributions, which would be acceptable in my scenario with a single mu/sigma covering both, but ideally would be identified separately.

Extrapolating this, another approach would be a "floating probability density calculation" that uses only a limited buffer and adjusts automatically to the range (due to the long tail, bins may not be spaced evenly) - haven't found anything, but with some assumptions about the distribution it should be possible in principle.

Why (since it was asked) -

For a complex process we need to make guarantees such as "only 0.1% of runs exceed a limit of 3 seconds, and the average processing time is 2.8 seconds". The performance of an isolated piece of code can be very different from a normal run-time environment involving varying levels of disk and network access, background services, scheduled events that occur within a day, etc.

This can be solved trivially by accumulating all data. However, to accumulate this data in production, the data produced needs to be limited. For analysis of isolated pieces of code, a gaussian deviation plus first run penalty is ok. That doesn't work anymore for the distributions found above.

[edit] I've already got very good answers (and finally - maybe - some time to work on this). I'm starting a bounty to look for more input / ideas.

864

asked Dec 08 '09 14:12

peterchen

1 Answers

Often when you have a random value that can only be positive, a log-normal distribution is a good way to model it. That is, you take the log of each measurement, and assume that is normally distributed.

If you want, you can consider that to have multiple humps, i.e. to be the sum of two normals having different mean. Those are a bit tricky to estimate the parameters of, because you may have to estimate, for each measurement, its probability of belonging to each hump. That may be more than you want to bother with.

Log-normal distributions are very convenient and well-behaved. For example, you don't deal with its average, you deal with it's geometric mean, which is the same as its median.

BTW, in pharmacometric modeling, log-normal distributions are ubiquitous, modeling such things as blood volume, absorption and elimination rates, body mass, etc.

ADDED: If you want what you call a floating distribution, that's called an empirical or non-parametric distribution. To model that, typically you save the measurements in a sorted array. Then it's easy to pick off the percentiles. For example the median is the "middle number". If you have too many measurements to save, you can go to some kind of binning after you have enough measurements to get the general shape.

ADDED: There's an easy way to tell if a distribution is normal (or log-normal). Take the logs of the measurements and put them in a sorted array. Then generate a QQ plot (quantile-quantile). To do that, generate as many normal random numbers as you have samples, and sort them. Then just plot the points, where X is the normal distribution point, and Y is the log-sample point. The results should be a straight line. (A really simple way to generate a normal random number is to just add together 12 uniform random numbers in the range +/- 0.5.)

185

answered Nov 06 '22 15:11

Mike Dunlavey

Related questions
                            
                                Major Performance Issues with Java 8 ScriptEngine Compared to Java 7
                            
                                Significant performance difference of std clock between different machines
                            
                                How to use a disruptor with multiple message types
                            
                                Why is it faster to calculate the product of a consecutive array of integers by performing the calculation in pairs?
                            
                                Jboss slows down after a while
                            
                                Java Math.abs(int) optimizations, why this code 6x times slower?
                            
                                Optimizing an arithmetic coder
                            
                                VS 2010 very slow
                            
                                Can deriving a class from 'enable_shared_from_this' increase performance?
                            
                                Performance of string.IndexOf OrdinalIgnoreCase vs CurrentCultureIgnoreCase [duplicate]
                            
                                Is an empty function called at all in optimised code?
                            
                                How to avoid slowdown due to locked code?
                            
                                Bulk insert performance in MongoDB for large collections
                            
                                LIMIT 1 is very slow, for specific records, using different keys
                            
                                How to test performance / load of a modern angular application
                            
                                Java multithreaded file downloading performance
                            
                                Android Battery usage profiling
                            
                                First WCF connection made in new AppDomain is very slow
                            
                                Slow index speed of Elasticsearch
                            
                                Mysql: 7 billions records in a table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Modeling distribution of performance measurements

Tags:

performance

algorithm

math

statistics

peterchen

People also ask

1 Answers

Mike Dunlavey

Recent Activity

Donate For Us