I'm trying to do a simple variance calculation on a set of 3 numbers:
numpy.var([0.82159889, 0.26007962, 0.09818412])
which returns
0.09609366366174843
However, when you calculate the variance it should actually be
0.1441405
Seems like such a simple thing, but I haven't been able to find an answer yet.
As the documentation explains:
ddof : int, optional
"Delta Degrees of Freedom": the divisor used in the calculation is
``N - ddof``, where ``N`` represents the number of elements. By
default `ddof` is zero.
And so you have:
>>> numpy.var([0.82159889, 0.26007962, 0.09818412], ddof=0)
0.09609366366174843
>>> numpy.var([0.82159889, 0.26007962, 0.09818412], ddof=1)
0.14414049549262264
Both conventions are common enough that you always need to check which one is being used by whatever package you're using, in any language.
np.var
by default calculates the population variance.
The Sum of Squared Errors can be calculated as follows:
>>> vals = [0.82159889, 0.26007962, 0.09818412]
>>> mean = sum(vals)/3.0
>>> mean
0.3932875433333333
>>> sum((mean-val)**2 for val in vals)
0.2882809909852453
>>> sse = sum((mean-val)**2 for val in vals)
This is the population variance:
>>> sse/3
0.09609366366174843
>>> np.var(vals)
0.09609366366174843
This is the sample variance:
>>> sse/(3-1)
0.14414049549262264
>>> np.var(vals, ddof=1)
0.14414049549262264
You can read more about the difference here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With