If I have a list like this:
results=[-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
I want to calculate the variance of this list in Python which is the average of the squared differences from the mean.
How can I go about this? Accessing the elements in the list to do the computations is confusing me for getting the square differences.
Coding a stdev() Function in Python Our stdev() function takes some data and returns the population standard deviation. To do that, we rely on our previous variance() function to calculate the variance and then we use math. sqrt() to take the square root of the variance.
You can calculate the variance of a Pandas DataFrame by using the pd. var() function that calculates the variance along all columns. You can then get the column you're interested in after the computation.
In NumPy, the variance can be calculated for a vector or a matrix using the var() function. By default, the var() function calculates the population variance. To calculate the sample variance, you must set the ddof argument to the value 1.
You can use numpy's built-in function var
:
import numpy as np
results = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
print(np.var(results))
This gives you 28.822364260579157
If - for whatever reason - you cannot use numpy
and/or you don't want to use a built-in function for it, you can also calculate it "by hand" using e.g. a list comprehension:
# calculate mean
m = sum(results) / len(results)
# calculate variance using a list comprehension
var_res = sum((xi - m) ** 2 for xi in results) / len(results)
which gives you the identical result.
If you are interested in the standard deviation, you can use numpy.std:
print(np.std(results))
5.36864640860051
@Serge Ballesta explained very well the difference between variance n
and n-1
. In numpy you can easily set this parameter using the option ddof
; its default is 0
, so for the n-1
case you can simply do:
np.var(results, ddof=1)
The "by hand" solution is given in @Serge Ballesta's answer.
Both approaches yield 32.024849178421285
.
You can set the parameter also for std
:
np.std(results, ddof=1)
5.659050201086865
Starting Python 3.4
, the standard library comes with the variance
function (sample variance or variance n-1) as part of the statistics
module:
from statistics import variance
# data = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439, 0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
variance(data)
# 32.024849178421285
The population variance (or variance n) can be obtained using the pvariance
function:
from statistics import pvariance
# data = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439, 0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
pvariance(data)
# 28.822364260579157
Also note that if you already know the mean of your list, the variance
and pvariance
functions take a second argument (respectively xbar
and mu
) in order to spare recomputing the mean of the sample (which is part of the variance computation).
Well, there are two ways for defining the variance. You have the variance n that you use when you have a full set, and the variance n-1 that you use when you have a sample.
The difference between the 2 is whether the value m = sum(xi) / n
is the real average or whether it is just an approximation of what the average should be.
Example1 : you want to know the average height of the students in a class and its variance : ok, the value m = sum(xi) / n
is the real average, and the formulas given by Cleb are ok (variance n).
Example2 : you want to know the average hour at which a bus passes at the bus stop and its variance. You note the hour for a month, and get 30 values. Here the value m = sum(xi) / n
is only an approximation of the real average, and that approximation will be more accurate with more values. In that case the best approximation for the actual variance is the variance n-1
varRes = sum([(xi - m)**2 for xi in results]) / (len(results) -1)
Ok, it has nothing to do with Python, but it does have an impact on statistical analysis, and the question is tagged statistics and variance
Note: ordinarily, statistical libraries like numpy use the variance n for what they call var
or variance
, and the variance n-1 for the function that gives the standard deviation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With