Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I calculate the variance of a list in python?

If I have a list like this:

results=[-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
          0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]

I want to calculate the variance of this list in Python which is the average of the squared differences from the mean.

How can I go about this? Accessing the elements in the list to do the computations is confusing me for getting the square differences.

like image 694
minks Avatar asked Feb 23 '16 16:02

minks


People also ask

How do you find the variance and standard deviation of a list in Python?

Coding a stdev() Function in Python Our stdev() function takes some data and returns the population standard deviation. To do that, we rely on our previous variance() function to calculate the variance and then we use math. sqrt() to take the square root of the variance.

How does Python calculate variance in pandas?

You can calculate the variance of a Pandas DataFrame by using the pd. var() function that calculates the variance along all columns. You can then get the column you're interested in after the computation.

How does NumPy calculate variance?

In NumPy, the variance can be calculated for a vector or a matrix using the var() function. By default, the var() function calculates the population variance. To calculate the sample variance, you must set the ddof argument to the value 1.


3 Answers

You can use numpy's built-in function var:

import numpy as np

results = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
          0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]

print(np.var(results))

This gives you 28.822364260579157

If - for whatever reason - you cannot use numpy and/or you don't want to use a built-in function for it, you can also calculate it "by hand" using e.g. a list comprehension:

# calculate mean
m = sum(results) / len(results)

# calculate variance using a list comprehension
var_res = sum((xi - m) ** 2 for xi in results) / len(results)

which gives you the identical result.

If you are interested in the standard deviation, you can use numpy.std:

print(np.std(results))
5.36864640860051

@Serge Ballesta explained very well the difference between variance n and n-1. In numpy you can easily set this parameter using the option ddof; its default is 0, so for the n-1 case you can simply do:

np.var(results, ddof=1)

The "by hand" solution is given in @Serge Ballesta's answer.

Both approaches yield 32.024849178421285.

You can set the parameter also for std:

np.std(results, ddof=1)
5.659050201086865
like image 133
Cleb Avatar answered Oct 16 '22 18:10

Cleb


Starting Python 3.4, the standard library comes with the variance function (sample variance or variance n-1) as part of the statistics module:

from statistics import variance
# data = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439, 0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
variance(data)
# 32.024849178421285

The population variance (or variance n) can be obtained using the pvariance function:

from statistics import pvariance
# data = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439, 0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
pvariance(data)
# 28.822364260579157

Also note that if you already know the mean of your list, the variance and pvariance functions take a second argument (respectively xbar and mu) in order to spare recomputing the mean of the sample (which is part of the variance computation).

like image 16
Xavier Guihot Avatar answered Oct 16 '22 18:10

Xavier Guihot


Well, there are two ways for defining the variance. You have the variance n that you use when you have a full set, and the variance n-1 that you use when you have a sample.

The difference between the 2 is whether the value m = sum(xi) / n is the real average or whether it is just an approximation of what the average should be.

Example1 : you want to know the average height of the students in a class and its variance : ok, the value m = sum(xi) / n is the real average, and the formulas given by Cleb are ok (variance n).

Example2 : you want to know the average hour at which a bus passes at the bus stop and its variance. You note the hour for a month, and get 30 values. Here the value m = sum(xi) / n is only an approximation of the real average, and that approximation will be more accurate with more values. In that case the best approximation for the actual variance is the variance n-1

varRes = sum([(xi - m)**2 for xi in results]) / (len(results) -1)

Ok, it has nothing to do with Python, but it does have an impact on statistical analysis, and the question is tagged statistics and variance

Note: ordinarily, statistical libraries like numpy use the variance n for what they call var or variance, and the variance n-1 for the function that gives the standard deviation.

like image 15
Serge Ballesta Avatar answered Oct 16 '22 19:10

Serge Ballesta