Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

calculate mean and variance with one iteration

I have an iterator of numbers, for example a file object:

f = open("datafile.dat")

now I want to compute:

mean = get_mean(f)
sigma = get_sigma(f, mean)

What is the best implementation? Suppose that the file is big and I would like to avoid to read it twice.

like image 942
Ruggero Turra Avatar asked Feb 26 '10 11:02

Ruggero Turra


2 Answers

If you want to iterate once, you can write your sum function:

def mysum(l):
    s2 = 0
    s = 0
    for e in l:
        s += e
        s2 += e * e
    return (s, s2)

and use the result in your sigma function.

Edit: now you can calculate the variance like this: (s2 - (s*s) / N) / N

By taking account of @Adam Bowen's comment,
keep in mind that if we use mathematical tricks and transform the original formulas
we may degrade the results.

like image 197
Nick Dandoulakis Avatar answered Oct 11 '22 23:10

Nick Dandoulakis


I think Nick D has the correct answer.

Assuming you want to compute both mean and variance in one sweep of the file (and you don't really want two functions that have to be called one after the other), you can collect the sum of the values and of their squares and them use such sums (toghether with the number of read elements) to compute at the same time mean and variance.

There are some numerical stability issues, but the idea in

http://en.wikipedia.org/wiki/Computational_formula_for_the_variance

is the basic ingredient you need. Some more details are at

http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

where I suggest you to read the "Naïve algorithm".

Hope this helps,

Massimo

like image 45
Mapio Avatar answered Oct 12 '22 01:10

Mapio