Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mean value and standard deviation of a very huge data set

Tags:

I am wondering if there is an algorithm that calculates the mean value and standard deviation of an unbound data set.

for example, I am monitoring an measurement value, say, electric current. I would like to have the mean value of all historical data. Whenever a new value come, update the mean and stdev? Because the data is too big to store, I hope it can just update the mean and stdev on the fly without storing the data.

Even data is stored, the standard way (d1+...+dn)/n, doesn't work, the sum will blow out the data representation.

I through about sum(d1/n + d2/n + ... d3/n), if n is hugh, the error is too big and accumulated. Besides, n is unbound in this case.

The number of data is definitely unbound, whenever it comes, it requires to update the value.

Does anybody know if there is an algorithm for it?

like image 706
Alfred Zhong Avatar asked Apr 28 '12 15:04

Alfred Zhong


People also ask

How do you find the mean of a very large data set?

To find the arithmetic mean of a data set, all you need to do is add up all the numbers in the data set and then divide the sum by the total number of values.

How do you find the standard deviation of a large data set?

To do this, add up all the numbers in a data set and divide by the total number of pieces of data. For example, if you have four numbers in a data set, divide the sum by four. This is the mean of the data set. Subtract the deviance of each piece of data by subtracting the mean from each number.

What does it mean when a data set has a large standard deviation?

Low standard deviation means data are clustered around the mean, and high standard deviation indicates data are more spread out. A standard deviation close to zero indicates that data points are close to the mean, whereas a high or low standard deviation indicates data points are respectively above or below the mean.

How do you find the mean and standard deviation of a set of data?

mean: simply divide the sum of pixel values by the total count - number of pixels in the dataset computed as len(df) * image_size * image_size. standard deviation: use the following equation: total_std = sqrt(psum_sq / count - total_mean ** 2)


2 Answers

[did the question change? maybe i only read the start. i have updated/edited to give a better reply:]

there is no perfect solution (in constant memory) that i know of, but i can give various approaches.

first, for the basic calculation you only need the sum of all values (sum_x), the sum of their squares (sum_x2), and the total count (n). then:

mean = sum_x / n stdev = sqrt( sum_x2/n - mean^2 ) 

and all these values (sum_x, sum_x2, n) can be updated from a stream.

the problem (as you say) is dealing with overflow and / or limited precision. you can see this if you consider floating point when sum_x2 is so large that the internal representation doesn't include values of the magnitude of a single squared value.

a simple way to avoid the problem is to use exact arithmetic, but that will be increasingly slow (and also uses O(log(n)) memory).

another way that can help is to normalise values - if you know that values are typically X then you can do calculations on x - X which makes the sums smaller (obviously you then add X to the mean!). that helps postpone the point at which precision is lost (and can/should be combined with any other method here - when binning, for example, you can use the mean of the previous bin). see this algorithm (knuth's method) for how to do this progressively.

if you don't mind a (small constant factor) O(n) memory cost you could restart every N values (eg million - smarter still would be to adapt this value by detecting when precision is too low), storing previous mean and stdev and then combine for the final result (so your mean is the appropriately weighted value from the recent running total and the old binned values).

the binning approach could probably be generalised to multiple levels (you start binning the bins) and would reduce to O(log(n)) memory use, but i haven't worked out the details.

finally, a more practical solution would probably be to do the initial approach for, say, 1000 values, and then start a new sum in parallel. you could display a weighted average of the two and, after another 1000 values, drop the old sums (after gradually decreasing their weight) and start a new set. so you always have two sets of sums, and display a weighted average between them, which gives continuous data that reflect the last 1000 values (only). in some cases that will be good enough, i imagine (it's not an exact value, since it's only for "recent" data, but it's smooth and representative, and uses a fixed amount of memory).

ps, something that occurred to me later - really, doing this "forever" doesn't make much sense anyway, because you'll get to a point where the values are absolutely dominated by the old data. it would be better to use a "running mean" which gives decreased weight to old values. see for example https://en.wikipedia.org/wiki/Moving_average - however, i don't know of a common equivalent to stdev.

like image 123
andrew cooke Avatar answered Mar 11 '23 00:03

andrew cooke


Interesting question.

Let's discuss the mean first, if only because it's a little simpler.

You are right about the rounding error on a running total. It'll kill your accuracy for large enough a data set. You'd like to presort the data, totaling the small data first; but of course this is impossible in your case. However, you can achieve most of the benefit of sorted data by keeping several running totals.

A conceptual example, C- or C++-style:

const double max_small  =    0.001; const double max_medium = 1000.0;  double total_small; double total_medium; double total_large;  while(true) {     const double datum = get_datum(); // (Use here whatever function you use to get a datum.)     if (!is_datum_valid()) break;     if (abs(datum) <= max_small) total_small += datum;     else if (abs(datum) <= max_medium) total_medium += datum;     else total_large += datum; }  double total = 0.0; total += total_small; total += total_medium; total += total_large; 

In realistic code, you'll probably keep more than three running totals -- and of course you'll also keep running totals of the squares of the data -- but the above example conveys the idea. You can fill in details.

Also, adapting @andrewcooke's idea, you can expand the loop in something like this manner:

while(true) {     const double datum = get_datum();     if (!is_datum_valid()) break;     if (abs(datum) <= max_small) {         total_small += datum;         if (abs(total_small) > max_medium) {             total_large += total_small;             total_small = 0.0;         }     }     else if (abs(datum) <= max_medium) total_medium += datum;     else total_large += datum; } 

Again, you can fill in details. Good luck.

APPENDIX: THE PRACTICAL CALCULATION OF THE STANDARD DEVIATION

A good question has been raised in the various comment threads here regarding how a standard deviation is to be calculated without prior knowledge of the mean. Fortunately, a trick is known to calculate the standard deviation so. I have prepared two pages of notes that explain the trick here (in PDF).

At any rate, all that is necessary to include the standard deviation among the running statistics is to total not just the data but also the squares of the data; and, of course, the squares can be totaled in like manner as the data, themselves, following the same pattern as in the code above.

like image 37
thb Avatar answered Mar 11 '23 01:03

thb