Why is the var() function giving me a different answer than my calculated variance?

Tags:

variance

I wasn't sure if this should go in SO or some other .SE, so I will delete if this is deemed to be off-topic.

I have a vector and I'm trying to calculate the variance "by hand" (meaning based on the definition of variance but still performing the calculations in R) using the equation: V[X] = E[X^2] - E[X]^2 where E[X] = sum (x * f(x)) and E[X^2] = sum (x^2 * f(x))

However, my calculated variance is different from the var() function that R has (which I was using to check my work). Why is the var() function different? How is it calculating variance? I've checked my calculations several times so I'm fairly confident in the value I calculated. My code is provided below.

vec <- c(3, 5, 4, 3, 6, 7, 3, 6, 4, 6, 3, 4, 1, 3, 4, 4)
range(vec)
counts <- hist(vec + .01, breaks = 7)$counts
fx <- counts / (sum(counts)) #the pmf f(x)
x <- c(min(vec): max(vec)) #the values of x
exp <- sum(x * fx) ; exp #expected value of x
exp.square <- sum(x^2 * fx) #expected value of x^2
var <- exp.square - (exp)^2 ; var #calculated variance
var(vec)

This gives me a calculated variance of 2.234 but the var() function says the variance is 2.383.

738

asked Feb 20 '15 20:02

pocketlizard

2 Answers

While V[X] = E[X^2] - E[X]^2 is the population variance (when the values in the vector are the whole population, not just a sample), the var function calculates an estimator for the population variance (the sample variance).

183

answered Oct 12 '22 12:10

Sven Hohenstein

While this has been answered already, I fear some may still be confused between population variance and its estimate from a sample, and this may be due to the example.

If the vector vec represents the full population, then vec is simply a way to represent the distribution function, which can be summarized more succinctly in the pmf that you derived from it. Crucially, the elements of vec in this case are not random variables. In this case, your computations of E[X] and var[X] from the pmf are correct.

Most of the time, however, when you have data (for instance in the form of a vector) it is a random sample from the underlying population. Each element of the vector is the observed value of a random variable: it is a "draw" from the population. For this example, it is fair to assume that each element is drawn independently, from the same distribution ("iid"). In practice, this random sampling means that you cannot compute the true pmf, as you may have some variations due merely to chance. Likewise, you can't get the true value of E[X], E[X^2], and thus Var[X], from the sample. These values need to be estimated. The sample average is usually a good estimate for E[X] (in particular, it is unbiased), but it turns out that the sample variance is a biased estimate for the population variance. To correct for this bias, you need to multiply it by the factor n/(n-1).

As this latter case is the most seen in practice (aside from textbook exercises), it is what is computed when you call the var() function in R. So if you're asked to find the "estimated variance", it most likely implies that your vector vec is a random sample and that you fall in this latter case. If this was the original question, then you have your answer, and I hope it becomes clear that the choice of the name of variables and the commenting in your code can lead to confusion: indeed, you cannot compute the pmf, the expected value or the variance of the population from a random sample: what you can get are their estimates, and one of them -- that of the variance -- is biased.

I wanted to clarify this, as this confusion, as seen in the coding, is very common when first being acquainted with these concepts. In particular, the accepted answer may be misleading: V[X] = E[X^2] - E[X]^2 is not the sample variance; it is indeed the population variance, which you cannot get from the random sample. If you replace the values in this equation by their sample estimate (as averages), you will get sample.V[X] = average[X^2] - average[X]^2, which is the sample variance, and is biased.

Some may say that I am picky on the semantics; however, the "abuse of notation" in the accepted answer is only acceptable when everybody recognizes it as such. However, for those trying to figure out these conceptual differences, I believe it is best to remain precise.

answered Oct 12 '22 11:10

wiwh

Related questions
                            
                                dplyr 'rename' standard evaluation function not working as expected?
                            
                                How to underline text in a plot title or label? (ggplot2)
                            
                                Using dplyr to create summary proportion table with several categorical/factor variables
                            
                                How do we set constant variables while building R packages?
                            
                                Understanding num_classes for xgboost in R
                            
                                igraph: Resolving tight overlapping nodes
                            
                                How to embed local Video in R Markdown?
                            
                                Difference between Distinct vs Unique
                            
                                deparse(substitute()) returns function name normally, but function code when called inside for loop
                            
                                Formatting percentages in R-package openxlsx
                            
                                R error "Can't join on ... because of incompatible types"
                            
                                Call R scripts in Matlab
                            
                                Dependency management in R
                            
                                How to self join a data.table on a condition
                            
                                R data.table slow aggregation when using .SD
                            
                                cbind: is there a way to have missing values set to NA?
                            
                                Plotting a large number of custom functions in ggplot in R using stat_function()
                            
                                How to debug (placing break point,etc) an installed R package in RStudio?
                            
                                preserve old (pre 3.1.0) type.convert behavior
                            
                                "for" loop only adds the final ggplot layer

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With