Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any python function/library for calculate binomial confidence intervals?

I need to calculate binomial confidence intervals for large set of data within a script of python. Do you know any function or library of python that can do this?

Ideally I would like to have a function like this http://statpages.org/confint.html implemented on python.

Thanks for your time.

like image 631
Geparada Avatar asked Oct 24 '12 22:10

Geparada


People also ask

Which Python function will give the 95% confidence interval?

ppf function. The arguments for t. ppf() are q = percentage, df = degree of freedom, scale = std dev, loc = mean. As t-distribution is symmetric for a 95% confidence interval q will be 0.975.

Which Python module creates confidence intervals?

interval() function from the scipy. stats library to calculate a confidence interval for a population mean.

How do you find the binomial confidence interval?

Divide the numbers you found in the table by the number of population members. In this example, there are 10,000 members, so the confidence interval is: 2.202 / 10,000 = 0.00022. 13.06 / 10,000 = 0.001306.


8 Answers

Just noting because it hasn't been posted elsewhere here that statsmodels.stats.proportion.proportion_confint lets you get a binomial confidence interval with a variety of methods. It only does symmetric intervals, though.

like image 106
Danica Avatar answered Oct 01 '22 05:10

Danica


I would say that R (or another stats package) would probably serve you better if you have the option. That said, if you only need the binomial confidence interval you probably don't need an entire library. Here's the function in my most naive translation from javascript.

def binP(N, p, x1, x2):
    p = float(p)
    q = p/(1-p)
    k = 0.0
    v = 1.0
    s = 0.0
    tot = 0.0

    while(k<=N):
            tot += v
            if(k >= x1 and k <= x2):
                    s += v
            if(tot > 10**30):
                    s = s/10**30
                    tot = tot/10**30
                    v = v/10**30
            k += 1
            v = v*q*(N+1-k)/k
    return s/tot

def calcBin(vx, vN, vCL = 95):
    '''
    Calculate the exact confidence interval for a binomial proportion

    Usage:
    >>> calcBin(13,100)    
    (0.07107391357421874, 0.21204372406005856)
    >>> calcBin(4,7)   
    (0.18405151367187494, 0.9010086059570312)
    ''' 
    vx = float(vx)
    vN = float(vN)
    #Set the confidence bounds
    vTU = (100 - float(vCL))/2
    vTL = vTU

    vP = vx/vN
    if(vx==0):
            dl = 0.0
    else:
            v = vP/2
            vsL = 0
            vsH = vP
            p = vTL/100

            while((vsH-vsL) > 10**-5):
                    if(binP(vN, v, vx, vN) > p):
                            vsH = v
                            v = (vsL+v)/2
                    else:
                            vsL = v
                            v = (v+vsH)/2
            dl = v

    if(vx==vN):
            ul = 1.0
    else:
            v = (1+vP)/2
            vsL =vP
            vsH = 1
            p = vTU/100
            while((vsH-vsL) > 10**-5):
                    if(binP(vN, v, 0, vx) < p):
                            vsH = v
                            v = (vsL+v)/2
                    else:
                            vsL = v
                            v = (v+vsH)/2
            ul = v
    return (dl, ul)
like image 41
Curt Avatar answered Oct 02 '22 05:10

Curt


While the scipy.stats module has a method .interval() to compute the equal tails confidence, it lacks a similar method to compute the highest density interval. Here is a rough way to do it using methods found in scipy and numpy.

This solution also assumes you want to use a Beta distribution as a prior. The hyper-parameters a and b are set to 1, so that the default prior is a uniform distribution between 0 and 1.

import numpy
from scipy.stats import beta
from scipy.stats import norm

def binomial_hpdr(n, N, pct, a=1, b=1, n_pbins=1e3):
    """
    Function computes the posterior mode along with the upper and lower bounds of the
    **Highest Posterior Density Region**.

    Parameters
    ----------
    n: number of successes 
    N: sample size 
    pct: the size of the confidence interval (between 0 and 1)
    a: the alpha hyper-parameter for the Beta distribution used as a prior (Default=1)
    b: the beta hyper-parameter for the Beta distribution used as a prior (Default=1)
    n_pbins: the number of bins to segment the p_range into (Default=1e3)

    Returns
    -------
    A tuple that contains the mode as well as the lower and upper bounds of the interval
    (mode, lower, upper)

    """
    # fixed random variable object for posterior Beta distribution
    rv = beta(n+a, N-n+b)
    # determine the mode and standard deviation of the posterior
    stdev = rv.stats('v')**0.5
    mode = (n+a-1.)/(N+a+b-2.)
    # compute the number of sigma that corresponds to this confidence
    # this is used to set the rough range of possible success probabilities
    n_sigma = numpy.ceil(norm.ppf( (1+pct)/2. ))+1
    # set the min and max values for success probability 
    max_p = mode + n_sigma * stdev
    if max_p > 1:
        max_p = 1.
    min_p = mode - n_sigma * stdev
    if min_p > 1:
        min_p = 1.
    # make the range of success probabilities
    p_range = numpy.linspace(min_p, max_p, n_pbins+1)
    # construct the probability mass function over the given range
    if mode > 0.5:
        sf = rv.sf(p_range)
        pmf = sf[:-1] - sf[1:]
    else:
        cdf = rv.cdf(p_range)
        pmf = cdf[1:] - cdf[:-1]
    # find the upper and lower bounds of the interval 
    sorted_idxs = numpy.argsort( pmf )[::-1]
    cumsum = numpy.cumsum( numpy.sort(pmf)[::-1] )
    j = numpy.argmin( numpy.abs(cumsum - pct) )
    upper = p_range[ (sorted_idxs[:j+1]).max()+1 ]
    lower = p_range[ (sorted_idxs[:j+1]).min() ]    

    return (mode, lower, upper)
like image 22
mtw729 Avatar answered Sep 30 '22 05:09

mtw729


Just been trying this myself. If it helps here's my solution, which takes two lines of code and seems to give equivalent results to that JS page. This is the frequentist one-sided interval, I'm calling the input argument the MLE (maximum likelihood estimate) of the binomial parameter theta. I.e. mle = number of successes/number of trials. I find the upper bound of the one sided interval. The alpha value used here is therefore double the one in the JS page for the upper limit.

from scipy.stats import binom
from scipy.optimize import bisect

def binomial_ci( mle, N, alpha=0.05 ):
    """
    One sided confidence interval for a binomial test.

    If after N trials we obtain mle as the proportion of those
    trials that resulted in success, find c such that

    P(k/N < mle; theta = c) = alpha

    where k/N is the proportion of successes in the set of trials,
    and theta is the success probability for each trial. 
    """


    to_minimise = lambda c: binom.cdf(mle*N,N,c)-alpha
    return bisect(to_minimise,0,1)

To find the two sided interval, call with (1-alpha/2) and alpha/2 as arguments.

like image 28
James Thorniley Avatar answered Sep 28 '22 05:09

James Thorniley


I needed to do this as well. I was using R and wanted to learn a way to work it out for myself. I would not say it is strictly pythonic.

The docstring explains most of it. It assumes you have scipy installed.

def exact_CI(x, N, alpha=0.95):
    """
    Calculate the exact confidence interval of a proportion 
    where there is a wide range in the sample size or the proportion.

    This method avoids the assumption that data are normally distributed. The sample size
    and proportion are desctibed by a beta distribution.

    Parameters
    ----------

    x: the number of cases from which the proportion is calulated as a positive integer.

    N: the sample size as a positive integer.

    alpha : set at 0.95 for 95% confidence intervals.

    Returns
    -------
    The proportion with the lower and upper confidence intervals as a dict.

    """
    from scipy.stats import beta
    x = float(x)
    N = float(N)
    p = round((x/N)*100,2)

    intervals = [round(i,4)*100 for i in beta.interval(alpha,x,N-x+1)]
    intervals.insert(0,p)

    result = {'Proportion': intervals[0], 'Lower CI': intervals[1], 'Upper CI': intervals[2]}

    return result
like image 29
John Avatar answered Sep 29 '22 05:09

John


A numpy/scipy-free way of computing the same thing using the Wilson score and an approximation to the normal cumulative density function,

import math

def binconf(p, n, c=0.95):
  '''
  Calculate binomial confidence interval based on the number of positive and
  negative events observed.

  Parameters
  ----------
  p: int
      number of positive events observed
  n: int
      number of negative events observed
  c : optional, [0,1]
      confidence percentage. e.g. 0.95 means 95% confident the probability of
      success lies between the 2 returned values

  Returns
  -------
  theta_low  : float
      lower bound on confidence interval
  theta_high : float
      upper bound on confidence interval
  '''
  p, n = float(p), float(n)
  N    = p + n

  if N == 0.0: return (0.0, 1.0)

  p = p / N
  z = normcdfi(1 - 0.5 * (1-c))

  a1 = 1.0 / (1.0 + z * z / N)
  a2 = p + z * z / (2 * N)
  a3 = z * math.sqrt(p * (1-p) / N + z * z / (4 * N * N))

  return (a1 * (a2 - a3), a1 * (a2 + a3))


def erfi(x):
  """Approximation to inverse error function"""
  a  = 0.147  # MAGIC!!!
  a1 = math.log(1 - x * x)
  a2 = (
    2.0 / (math.pi * a)
    + a1 / 2.0
  )

  return (
    sign(x) *
    math.sqrt( math.sqrt(a2 * a2 - a1 / a) - a2 )
  )


def sign(x):
  if x  < 0: return -1
  if x == 0: return  0
  if x  > 0: return  1


def normcdfi(p, mu=0.0, sigma2=1.0):
  """Inverse CDF of normal distribution"""
  if mu == 0.0 and sigma2 == 1.0:
    return math.sqrt(2) * erfi(2 * p - 1)
  else:
    return mu + math.sqrt(sigma2) * normcdfi(p)
like image 32
duckworthd Avatar answered Oct 02 '22 05:10

duckworthd


The following gives exact (Clopper-Pearson) interval for binomial distribution in a simple way.

def binomial_ci(x, n, alpha=0.05):
    #x is number of successes, n is number of trials
    from scipy import stats
    if x==0:
        c1 = 0
    else:
        c1 = stats.beta.interval(1-alpha, x,n-x+1)[0]
    if x==n:
        c2=1
    else:
        c2 = stats.beta.interval(1-alpha, x+1,n-x)[1]
    return c1, c2

You may check the code by e.g.:

p1,p2 = binomial_ci(2,7)
from scipy import stats
assert abs(stats.binom.cdf(1,7,p1)-.975)<1E-5
assert abs(stats.binom.cdf(2,7,p2)-.025)<1E-5
assert abs(binomial_ci(0,7, alpha=.1)[0])<1E-5
assert abs((1-binomial_ci(0,7, alpha=.1)[1])**7-0.05)<1E-5
assert abs(binomial_ci(7,7, alpha=.1)[1]-1)<1E-5
assert abs((binomial_ci(7,7, alpha=.1)[0])**7-0.05)<1E-5

I used the relation between the binomial proportion confidence interval and the regularized incomplete beta function, as described here: https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper%E2%80%93Pearson_interval

like image 35
Dani Avatar answered Sep 30 '22 05:09

Dani


Astropy provides such a function (although installing and importing astropy may be a bit excessive): astropy.stats.binom_conf_interval

like image 31
TheBamf Avatar answered Sep 28 '22 05:09

TheBamf