I wasn't sure if this should go in SO or some other .SE, so I will delete if this is deemed to be off-topic.
I have a vector and I'm trying to calculate the variance "by hand" (meaning based on the definition of variance but still performing the calculations in R) using the equation: V[X] = E[X^2] - E[X]^2
where E[X] = sum (x * f(x))
and E[X^2] = sum (x^2 * f(x))
However, my calculated variance is different from the var()
function that R has (which I was using to check my work). Why is the var()
function different? How is it calculating variance? I've checked my calculations several times so I'm fairly confident in the value I calculated. My code is provided below.
vec <- c(3, 5, 4, 3, 6, 7, 3, 6, 4, 6, 3, 4, 1, 3, 4, 4)
range(vec)
counts <- hist(vec + .01, breaks = 7)$counts
fx <- counts / (sum(counts)) #the pmf f(x)
x <- c(min(vec): max(vec)) #the values of x
exp <- sum(x * fx) ; exp #expected value of x
exp.square <- sum(x^2 * fx) #expected value of x^2
var <- exp.square - (exp)^2 ; var #calculated variance
var(vec)
This gives me a calculated variance of 2.234 but the var()
function says the variance is 2.383.
var() function in R Language computes the sample variance of a vector. It is the measure of how much value is away from the mean value.
var(data)*(n-1)/n So, we can use the following simple calculation to retrieve the population variance from sample data. Since var() in R provides the sample variance, we can multiply var() with (n-1)/n to get the population variance.
The measurement is often applied to an investment portfolio for which the calculation gives a confidence interval about the likelihood of exceeding a certain loss threshold. The VaR calculation is a probability-based estimate of the minimum loss in dollar terms expected over a period.
While V[X] = E[X^2] - E[X]^2 is the population variance (when the values in the vector are the whole population, not just a sample), the var
function calculates an estimator for the population variance (the sample variance).
While this has been answered already, I fear some may still be confused between population variance and its estimate from a sample, and this may be due to the example.
If the vector vec
represents the full population, then vec
is simply a way to represent the distribution function, which can be summarized more succinctly in the pmf that you derived from it. Crucially, the elements of vec
in this case are not random variables. In this case, your computations of E[X] and var[X] from the pmf are correct.
Most of the time, however, when you have data (for instance in the form of a vector) it is a random sample from the underlying population. Each element of the vector is the observed value of a random variable: it is a "draw" from the population. For this example, it is fair to assume that each element is drawn independently, from the same distribution ("iid"). In practice, this random sampling means that you cannot compute the true pmf, as you may have some variations due merely to chance. Likewise, you can't get the true value of E[X], E[X^2], and thus Var[X], from the sample. These values need to be estimated. The sample average is usually a good estimate for E[X] (in particular, it is unbiased), but it turns out that the sample variance is a biased estimate for the population variance. To correct for this bias, you need to multiply it by the factor n/(n-1).
As this latter case is the most seen in practice (aside from textbook exercises), it is what is computed when you call the var()
function in R. So if you're asked to find the "estimated variance", it most likely implies that your vector vec
is a random sample and that you fall in this latter case. If this was the original question, then you have your answer, and I hope it becomes clear that the choice of the name of variables and the commenting in your code can lead to confusion: indeed, you cannot compute the pmf, the expected value or the variance of the population from a random sample: what you can get are their estimates, and one of them -- that of the variance -- is biased.
I wanted to clarify this, as this confusion, as seen in the coding, is very common when first being acquainted with these concepts. In particular, the accepted answer may be misleading: V[X] = E[X^2] - E[X]^2 is not the sample variance; it is indeed the population variance, which you cannot get from the random sample. If you replace the values in this equation by their sample estimate (as averages), you will get sample.V[X] = average[X^2] - average[X]^2, which is the sample variance, and is biased.
Some may say that I am picky on the semantics; however, the "abuse of notation" in the accepted answer is only acceptable when everybody recognizes it as such. However, for those trying to figure out these conceptual differences, I believe it is best to remain precise.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With