Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to interpret the upper/lower bound of a datapoint with confidence intervals?

Given a list of values:

>>> from scipy import stats
>>> import numpy as np
>>> x = list(range(100))

Using student t-test, I can find the confidence interval of the distribution at the mean with an alpha of 0.1 (i.e. at 90% confidence) with:

def confidence_interval(alist, v, itv):
    return stats.t.interval(itv, df=len(alist)-1, loc=v, scale=stats.sem(alist))

x = list(range(100))
confidence_interval(x, np.mean(x), 0.1)

[out]:

(49.134501289005009, 49.865498710994991)

But if I were to find the confidence interval at every datapoint, e.g. for the value 10:

>>> confidence_interval(x, 10, 0.1)
(9.6345012890050086, 10.365498710994991)

How should the interval of the values be interpreted? Is it statistically/mathematical sound to interpret that at all?

Does it goes something like:

At 90% confidence, we know that the data point 10 falls in the interval (9.6345012890050086, 10.365498710994991),

aka.

At 90% confidence, we can say that the data point falls at 10 +- 0.365...

So can we interpret the interval as some sort of a box plot of the datapoint?

like image 859
alvas Avatar asked Mar 15 '17 00:03

alvas


2 Answers

In short

Your call gives the interval of confidence for the mean parameter of a normal law of unknown parameters of which you observed 100 observations with an average of 10 and a stdv of 29. It is furthermore not sound to interpret it, since your distribution is clearly not normal, and because 10 is not the observed mean.

TL;DR

There are a lot misconceptions floating around confidence intervals, most of which seemingly stems from a misunderstanding of what we are confident about. Since there is some confusion in your understanding of confidence interval maybe a broader explanation will give a deeper understanding of the concepts you are handling, and hopefully definitely rule out any source of error.

Clearing out misconceptions

Very briefly to set things up. We are in a situation where we want to estimate a parameter, or rather, we want to test a hypothesis for the value of a parameter parameterizing the distribution of a random variable. e.g: Let's say I have a normally distributed variable X with mean m and standard deviation sigma, and I want to test the hypothesis m=0.

What is a parametric test

This a process for testing a hypothesis on a parameter for a random variable. Since we only have access to observations which are concrete realizations of the random variable, it generally procedes by computing a statistic of these realizations. A statistic is roughly a function of the realizations of a random variable. Let's call this function S, we can compute S on x_1,...,x_n which are as many realizations of X.

Therefore you understand that S(X) is a random variable as well with distribution, parameters and so on! The idea is that for standard tests, S(X) follows a very well known distribution for which values are tabulated. e.g: http://www.sjsu.edu/faculty/gerstman/StatPrimer/t-table.pdf

What is a confidence interval?

Given what we've just said, a definition for a confidence interval would be: the range of values for the tested parameter, such that if the observations were to have been generated from a distribution parametrized by a value in that range, it would not have probabilistically improbable. In other words, a confidence interval gives an answer to the question: given the following observations x_1,...,x_n n realizations of X, can we confidently say that X's distribution is parametrized by such value. 90%, 95%, etc... asserts the level of confidence. Usually, external constraints fix this level (industrial norms for quality assessment, scientific norms e.g: for the discovery of new particles).

I think it is now intuitive to you that:

  1. The higher the confidence level, the larger the confidence interval. e.g. for a confidence of 100% the confidence interval would range across all the possible values as soon as there is some uncertainty

  2. For most tests, under conditions I won't describe, the more observations we have, the more we can restrain the confidence interval.

At 90% confidence, we know that the data point 10 falls in the interval (9.6345012890050086, 10.365498710994991)

It is wrong to say that and it is the most common source of mistakes. A 90% confidence interval never means that the estimated parameter has 90% percent chance of falling into that interval. When the interval is computed, it covers the parameter or it does not, it is not a matter of probability anymore. 90% is an assessment of the reliability of the estimation procedure.

What is a student test?

Now let's come to your example and look at it under the lights of what we've just said. You to apply a Student test to your list of observations. First: a Student test aims at testing a hypothesis of equality between the mean m of a normally distributed random variable with unknown standard deviation, and a certain value m_0.

The statistic associated with this test is t = (np.mean(x) - m_0)/(s/sqrt(n)) where x is your vector of observations, n the number of observations and s the empirical standard deviation. With no surprise, this follows a Student distribution.

Hence, what you want to do is:

  1. compute this statistic for your sample, compute the confidence interval associated with a Student distribution with this many degrees of liberty, this theoretical mean, and confidence level

  2. see if your computed t falls into that interval, which tells you if you can rule out the equality hypothesis with such level of confidence.

I wanted to give you an exercise but I think I've been lengthy enough.

To conclude on the use of scipy.stats.t.interval. You can use it one of two ways. Either computing yourself the t statistic with the formula shown above and check if t fits in the interval returned by interval(alpha, df) where df is the length of your sampling. Or you can directly call interval(alpha, df, loc=m, scale=s) where m is your empirical mean, and s the empirical standard deviatation (divided by sqrt(n)). In such case, the returned interval will directly be the confidence interval for the mean.

So in your case your call gives the interval of confidence for the mean parameter of a normal law of unknown parameters of which you observed 100 observations with an average of 10 and a stdv of 29. It is furthermore not sound to interpret it, beside the error of interpretation I've already pointed out, since your distribution is clearly not normal, and because 10 is not the observed mean.

Resources

You can check out the following resources to go further.

wikipedia links to have quick references and an elborated overview

https://en.wikipedia.org/wiki/Confidence_interval

https://en.wikipedia.org/wiki/Student%27s_t-test

https://en.wikipedia.org/wiki/Student%27s_t-distribution

To go further

http://osp.mans.edu.eg/tmahdy/papers_of_month/0706_statistical.pdf

I haven't read it but the one below seems quite good. https://web.williams.edu/Mathematics/sjmiller/public_html/BrownClasses/162/Handouts/StatsTests04.pdf

You should also check out p-values, you will find a lot of similarities and hopefully you understand them better after reading this post.

https://en.wikipedia.org/wiki/P-value#Definition_and_interpretation

like image 83
Anis Avatar answered Oct 09 '22 04:10

Anis


Confidence intervals are hopelessly counter-intuitive. Especially for programmers, I dare say as a programmer.

Wikipedida uses a 90% confidence to illustrate a possible interpretation:

Were this procedure to be repeated on numerous samples, the fraction of calculated confidence intervals (which would differ for each sample) that encompass the true population parameter would tend toward 90%.

In other words

  1. The confidence interval provides information about a statistical parameter (such as the mean) of a sample.
  2. The interpretation of e.g. a 90% confidence interval would be: If you repeat the experiment an infinite number of times 90% of the resulting confidence intervals will contain the true parameter.

Assuming the code to compute the interval is correct (which I have not checked) you can use it to calculate the confidence interval of the mean (because of the t-distribution, which models the sample mean of a normally distributed population with unknown standard deviation).

For practical purposes it makes sense to pass in the sample mean. Otherwise you are saying "if I pretended my data had a sample mean of e.g. 10, the confidence interval of the mean would be [9.6, 10.3]".

The particular data passed into the confidence interval does not make sense either. Numbers increasing in a range from 0 to 99 are very unlikely to be drawn from a normal distribution.

like image 35
MB-F Avatar answered Oct 09 '22 05:10

MB-F