I have a lot of data stored at disk in large arrays. I cant load everything in memory altogether.
How one could calculate the mean and the standard deviation?
There is a simple online algorithm that computes both the mean and the variance by looking at each datapoint once and using O(1)
memory.
Wikipedia offers the following code:
def online_variance(data):
n = 0
mean = 0
M2 = 0
for x in data:
n = n + 1
delta = x - mean
mean = mean + delta/n
M2 = M2 + delta*(x - mean)
variance = M2/(n - 1)
return variance
This algorithm is also known as Welford's method. Unlike the method suggested in the other answer, it can be shown to have nice numerical properties.
Take the square root of the variance to get the standard deviation.
Sounds like a math question. For the mean, you know that you can take the mean of a chunk of data, and then take the mean of the means. If the chunks aren't the same size, you'll have to take a weighted average.
For the standard deviation, you'll have to calculate the variance first. I'd suggest doing this alongside the calculation of the mean. For variance, you have
Var(X) = Avg(X^2) - Avg(X)^2
So compute the average of your data, and the average of your (data^2). Aggregate them as above, and the take the difference.
Then the standard deviation is just the square root of the variance.
Note that you could do the whole thing with iterators, which is probably the most efficient.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With