import numpy as np
import scipy.stats
def mean_confidence_interval(data, confidence=0.95):
a = 1.0 * np.array(data)
n = len(a)
m, se = np.mean(a), scipy.stats.sem(a)
h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
return m, m-h, m+h
you can calculate like this way.
Here a shortened version of shasan's code, calculating the 95% confidence interval of the mean of array a
:
import numpy as np, scipy.stats as st
st.t.interval(0.95, len(a)-1, loc=np.mean(a), scale=st.sem(a))
But using StatsModels' tconfint_mean
is arguably even nicer:
import statsmodels.stats.api as sms
sms.DescrStatsW(a).tconfint_mean()
The underlying assumptions for both are that the sample (array a
) was drawn independently from a normal distribution with unknown standard deviation (see MathWorld or Wikipedia).
For large sample size n, the sample mean is normally distributed, and one can calculate its confidence interval using st.norm.interval()
(as suggested in Jaime's comment). But the above solutions are correct also for small n, where st.norm.interval()
gives confidence intervals that are too narrow (i.e., "fake confidence"). See my answer to a similar question for more details (and one of Russ's comments here).
Here an example where the correct options give (essentially) identical confidence intervals:
In [9]: a = range(10,14)
In [10]: mean_confidence_interval(a)
Out[10]: (11.5, 9.4457397432391215, 13.554260256760879)
In [11]: st.t.interval(0.95, len(a)-1, loc=np.mean(a), scale=st.sem(a))
Out[11]: (9.4457397432391215, 13.554260256760879)
In [12]: sms.DescrStatsW(a).tconfint_mean()
Out[12]: (9.4457397432391197, 13.55426025676088)
And finally, the incorrect result using st.norm.interval()
:
In [13]: st.norm.interval(0.95, loc=np.mean(a), scale=st.sem(a))
Out[13]: (10.23484868811834, 12.76515131188166)
Starting Python 3.8
, the standard library provides the NormalDist
object as part of the statistics
module:
from statistics import NormalDist
def confidence_interval(data, confidence=0.95):
dist = NormalDist.from_samples(data)
z = NormalDist().inv_cdf((1 + confidence) / 2.)
h = dist.stdev * z / ((len(data) - 1) ** .5)
return dist.mean - h, dist.mean + h
This:
Creates a NormalDist
object from the data sample (NormalDist.from_samples(data)
, which gives us access to the sample's mean and standard deviation via NormalDist.mean
and NormalDist.stdev
.
Compute the Z-score
based on the standard normal distribution (represented by NormalDist()
) for the given confidence using the inverse of the cumulative distribution function (inv_cdf
).
Produces the confidence interval based on the sample's standard deviation and mean.
This assumes the sample size is big enough (let's say more than ~100 points) in order to use the standard normal distribution rather than the student's t distribution to compute the z
value.
Start with looking up the z-value for your desired confidence interval from a look-up table. The confidence interval is then mean +/- z*sigma
, where sigma
is the estimated standard deviation of your sample mean, given by sigma = s / sqrt(n)
, where s
is the standard deviation computed from your sample data and n
is your sample size.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With