I have an iterator of numbers, for example a file object:
f = open("datafile.dat")
now I want to compute:
mean = get_mean(f)
sigma = get_sigma(f, mean)
What is the best implementation? Suppose that the file is big and I would like to avoid to read it twice.
If you want to iterate once, you can write your sum function:
def mysum(l):
s2 = 0
s = 0
for e in l:
s += e
s2 += e * e
return (s, s2)
and use the result in your sigma
function.
Edit: now you can calculate the variance like this: (s2 - (s*s) / N) / N
By taking account of @Adam Bowen's comment,
keep in mind that if we use mathematical tricks and transform the original formulas
we may degrade the results.
I think Nick D has the correct answer.
Assuming you want to compute both mean and variance in one sweep of the file (and you don't really want two functions that have to be called one after the other), you can collect the sum of the values and of their squares and them use such sums (toghether with the number of read elements) to compute at the same time mean and variance.
There are some numerical stability issues, but the idea in
http://en.wikipedia.org/wiki/Computational_formula_for_the_variance
is the basic ingredient you need. Some more details are at
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
where I suggest you to read the "Naïve algorithm".
Hope this helps,
Massimo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With