Mean value and standard deviation of a very huge data set

Tags:

I am wondering if there is an algorithm that calculates the mean value and standard deviation of an unbound data set.

for example, I am monitoring an measurement value, say, electric current. I would like to have the mean value of all historical data. Whenever a new value come, update the mean and stdev? Because the data is too big to store, I hope it can just update the mean and stdev on the fly without storing the data.

Even data is stored, the standard way (d1+...+dn)/n, doesn't work, the sum will blow out the data representation.

I through about sum(d1/n + d2/n + ... d3/n), if n is hugh, the error is too big and accumulated. Besides, n is unbound in this case.

The number of data is definitely unbound, whenever it comes, it requires to update the value.

Does anybody know if there is an algorithm for it?

706

asked Apr 28 '12 15:04

Alfred Zhong

2 Answers

[did the question change? maybe i only read the start. i have updated/edited to give a better reply:]

there is no perfect solution (in constant memory) that i know of, but i can give various approaches.

first, for the basic calculation you only need the sum of all values (sum_x), the sum of their squares (sum_x2), and the total count (n). then:

mean = sum_x / n stdev = sqrt( sum_x2/n - mean^2 )

and all these values (sum_x, sum_x2, n) can be updated from a stream.

the problem (as you say) is dealing with overflow and / or limited precision. you can see this if you consider floating point when sum_x2 is so large that the internal representation doesn't include values of the magnitude of a single squared value.

a simple way to avoid the problem is to use exact arithmetic, but that will be increasingly slow (and also uses O(log(n)) memory).

another way that can help is to normalise values - if you know that values are typically X then you can do calculations on x - X which makes the sums smaller (obviously you then add X to the mean!). that helps postpone the point at which precision is lost (and can/should be combined with any other method here - when binning, for example, you can use the mean of the previous bin). see this algorithm (knuth's method) for how to do this progressively.

if you don't mind a (small constant factor) O(n) memory cost you could restart every N values (eg million - smarter still would be to adapt this value by detecting when precision is too low), storing previous mean and stdev and then combine for the final result (so your mean is the appropriately weighted value from the recent running total and the old binned values).

the binning approach could probably be generalised to multiple levels (you start binning the bins) and would reduce to O(log(n)) memory use, but i haven't worked out the details.

finally, a more practical solution would probably be to do the initial approach for, say, 1000 values, and then start a new sum in parallel. you could display a weighted average of the two and, after another 1000 values, drop the old sums (after gradually decreasing their weight) and start a new set. so you always have two sets of sums, and display a weighted average between them, which gives continuous data that reflect the last 1000 values (only). in some cases that will be good enough, i imagine (it's not an exact value, since it's only for "recent" data, but it's smooth and representative, and uses a fixed amount of memory).

ps, something that occurred to me later - really, doing this "forever" doesn't make much sense anyway, because you'll get to a point where the values are absolutely dominated by the old data. it would be better to use a "running mean" which gives decreased weight to old values. see for example https://en.wikipedia.org/wiki/Moving_average - however, i don't know of a common equivalent to stdev.

123

answered Mar 11 '23 00:03

andrew cooke

Interesting question.

Let's discuss the mean first, if only because it's a little simpler.

You are right about the rounding error on a running total. It'll kill your accuracy for large enough a data set. You'd like to presort the data, totaling the small data first; but of course this is impossible in your case. However, you can achieve most of the benefit of sorted data by keeping several running totals.

A conceptual example, C- or C++-style:

const double max_small  =    0.001; const double max_medium = 1000.0;  double total_small; double total_medium; double total_large;  while(true) {     const double datum = get_datum(); // (Use here whatever function you use to get a datum.)     if (!is_datum_valid()) break;     if (abs(datum) <= max_small) total_small += datum;     else if (abs(datum) <= max_medium) total_medium += datum;     else total_large += datum; }  double total = 0.0; total += total_small; total += total_medium; total += total_large;

In realistic code, you'll probably keep more than three running totals -- and of course you'll also keep running totals of the squares of the data -- but the above example conveys the idea. You can fill in details.

Also, adapting @andrewcooke's idea, you can expand the loop in something like this manner:

while(true) {     const double datum = get_datum();     if (!is_datum_valid()) break;     if (abs(datum) <= max_small) {         total_small += datum;         if (abs(total_small) > max_medium) {             total_large += total_small;             total_small = 0.0;         }     }     else if (abs(datum) <= max_medium) total_medium += datum;     else total_large += datum; }

Again, you can fill in details. Good luck.

APPENDIX: THE PRACTICAL CALCULATION OF THE STANDARD DEVIATION

A good question has been raised in the various comment threads here regarding how a standard deviation is to be calculated without prior knowledge of the mean. Fortunately, a trick is known to calculate the standard deviation so. I have prepared two pages of notes that explain the trick here (in PDF).

At any rate, all that is necessary to include the standard deviation among the running statistics is to total not just the data but also the squares of the data; and, of course, the squares can be totaled in like manner as the data, themselves, following the same pattern as in the code above.

answered Mar 11 '23 01:03

thb

Related questions
                            
                                PHP interpolating string for function output
                            
                                How to write a JSON schema for array of objects?
                            
                                why protocol buffer bytes is string in c++?
                            
                                java attempting to assign weaker access privilege error
                            
                                Serving gzipped content directly — bad thing to do?
                            
                                How to configure Django on OpenShift? [closed]
                            
                                How to delete Mercurial branch after merge
                            
                                Importing function with an alias in Haskell
                            
                                C#/WPF - Get ScrollViewer scroll position?
                            
                                how do I disable Button asp.net
                            
                                PHP out of memory error even though memory_limit not reached
                            
                                is switch(true) {... valid javascript?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With