I have a 1-dimensional array of data:
a = np.array([1,2,3,4,4,4,5,5,5,5,4,4,4,6,7,8])
for which I want to obtain the 68% confidence interval (ie: the 1 sigma).
The first comment in this answer states that this can be achieved using scipy.stats.norm.interval
from the scipy.stats.norm function, via:
from scipy import stats import numpy as np mean, sigma = np.mean(a), np.std(a) conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma)
But a comment in this post states that the actual correct way of obtaining the confidence interval is:
conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a)))
that is, sigma is divided by the square-root of the sample size: np.sqrt(len(a))
.
The question is: which version is the correct one?
“ When reporting confidence intervals, use the format 95% CI [LL, UL] where LL is the lower limit of the confidence interval and UL is the upper limit. ” For example, one might report: 95% CI [5.62, 8.31].
Create a new sample based on our dataset, with replacement and with the same number of points. Calculate the mean value and store it in an array or list. Repeat the process many times (e.g. 1000) On the list of the mean values, calculate 2.5th percentile and 97.5th percentile (if you want a 95% confidence interval)
For a 95% confidence interval, we use z=1.96, while for a 90% confidence interval, for example, we use z=1.64.
The 68% confidence interval for a single draw from a normal distribution with mean mu and std deviation sigma is
stats.norm.interval(0.68, loc=mu, scale=sigma)
The 68% confidence interval for the mean of N draws from a normal distribution with mean mu and std deviation sigma is
stats.norm.interval(0.68, loc=mu, scale=sigma/sqrt(N))
Intuitively, these formulas make sense, since if you hold up a jar of jelly beans and ask a large number of people to guess the number of jelly beans, each individual may be off by a lot -- the same std deviation sigma
-- but the average of the guesses will do a remarkably fine job of estimating the actual number and this is reflected by the standard deviation of the mean shrinking by a factor of 1/sqrt(N)
.
If a single draw has variance sigma**2
, then by the Bienaymé formula, the sum of N
uncorrelated draws has variance N*sigma**2
.
The mean is equal to the sum divided by N. When you multiply a random variable (like the sum) by a constant, the variance is multiplied by the constant squared. That is
Var(cX) = c**2 * Var(X)
So the variance of the mean equals
(variance of the sum)/N**2 = N * sigma**2 / N**2 = sigma**2 / N
and so the standard deviation of the mean (which is the square root of the variance) equals
sigma/sqrt(N).
This is the origin of the sqrt(N)
in the denominator.
Here is some example code, based on Tom's code, which demonstrates the claims made above:
import numpy as np from scipy import stats N = 10000 a = np.random.normal(0, 1, N) mean, sigma = a.mean(), a.std(ddof=1) conf_int_a = stats.norm.interval(0.68, loc=mean, scale=sigma) print('{:0.2%} of the single draws are in conf_int_a' .format(((a >= conf_int_a[0]) & (a < conf_int_a[1])).sum() / float(N))) M = 1000 b = np.random.normal(0, 1, (N, M)).mean(axis=1) conf_int_b = stats.norm.interval(0.68, loc=0, scale=1 / np.sqrt(M)) print('{:0.2%} of the means are in conf_int_b' .format(((b >= conf_int_b[0]) & (b < conf_int_b[1])).sum() / float(N)))
prints
68.03% of the single draws are in conf_int_a 67.78% of the means are in conf_int_b
Beware that if you define conf_int_b
with the estimates for mean
and sigma
based on the sample a
, the mean may not fall in conf_int_b
with the desired frequency.
If you take a sample from a distribution and compute the sample mean and std deviation,
mean, sigma = a.mean(), a.std()
be careful to note that there is no guarantee that these will equal the population mean and standard deviation and that we are assuming the population is normally distributed -- those are not automatic givens!
If you take a sample and want to estimate the population mean and standard deviation, you should use
mean, sigma = a.mean(), a.std(ddof=1)
since this value for sigma is the unbiased estimator for the population standard deviation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With