Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to properly fit a beta distribution in python?

I am trying to get a correct way of fitting a beta distribution. It's not a real world problem i am just testing the effects of a few different methods, and in doing this something is puzzling me.

Here is the python code I am working on, in which I tested 3 different approaches: 1>: fit using moments (sample mean and variance). 2>: fit by minimizing the negative log-likelihood (by using scipy.optimize.fmin()). 3>: simply call scipy.stats.beta.fit()

from scipy.optimize import fmin
from scipy.stats import beta
from scipy.special import gamma as gammaf
import matplotlib.pyplot as plt
import numpy


def betaNLL(param,*args):
    '''Negative log likelihood function for beta
    <param>: list for parameters to be fitted.
    <args>: 1-element array containing the sample data.

    Return <nll>: negative log-likelihood to be minimized.
    '''

    a,b=param
    data=args[0]
    pdf=beta.pdf(data,a,b,loc=0,scale=1)
    lg=numpy.log(pdf)
    #-----Replace -inf with 0s------
    lg=numpy.where(lg==-numpy.inf,0,lg)
    nll=-1*numpy.sum(lg)
    return nll

#-------------------Sample data-------------------
data=beta.rvs(5,2,loc=0,scale=1,size=500)

#----------------Normalize to [0,1]----------------
#data=(data-numpy.min(data))/(numpy.max(data)-numpy.min(data))

#----------------Fit using moments----------------
mean=numpy.mean(data)
var=numpy.var(data,ddof=1)
alpha1=mean**2*(1-mean)/var-mean
beta1=alpha1*(1-mean)/mean

#------------------Fit using mle------------------
result=fmin(betaNLL,[1,1],args=(data,))
alpha2,beta2=result

#----------------Fit using beta.fit----------------
alpha3,beta3,xx,yy=beta.fit(data)

print '\n# alpha,beta from moments:',alpha1,beta1
print '# alpha,beta from mle:',alpha2,beta2
print '# alpha,beta from beta.fit:',alpha3,beta3

#-----------------------Plot-----------------------
plt.hist(data,bins=30,normed=True)
fitted=lambda x,a,b:gammaf(a+b)/gammaf(a)/gammaf(b)*x**(a-1)*(1-x)**(b-1) #pdf of beta

xx=numpy.linspace(0,max(data),len(data))
plt.plot(xx,fitted(xx,alpha1,beta1),'g')
plt.plot(xx,fitted(xx,alpha2,beta2),'b')
plt.plot(xx,fitted(xx,alpha3,beta3),'r')

plt.show()

The problem I have is about the normalization process (z=(x-a)/(b-a)) where a and b are the min and max of the sample, respectively.

When I don't do the normalization, everything works Ok, there are slight differences among different fitting methods, by reasonably good.

But when I did the normalization, here is the result plot I got.

Plot

Only the moment method (green line) looks Ok.

The scipy.stats.beta.fit() method (red line) is uniform always, no matter what parameters I use to generate the random numbers.

And the MLE (blue line) fails.

So it seems like the normalization is creating these issues. But I think it is legal to have x=0 and x=1 in the beta distribution. And if given a real world problem, isn't it the 1st step to normalize the sample observations to make it in between [0,1] ? In that case, how should I fit the curve?

like image 973
Jason Avatar asked Apr 27 '14 21:04

Jason


People also ask

Is beta distribution bounded?

The beta distribution is a bounded continuous distribution. If is often used to express an uncertainty in a proportion, frequency, or percentage, which are all quantities between 0 and 1.

How do you test a beta distribution?

The Formula for the Beta Distribution 'A' and 'b' are used for representing lower and the upper bounds respectively for the distribution. B(p, q) is the beta function. The beta function has this formula: B(α,β)=∫01t(α−1)(1−t)(β−1)dt.

What are the parameters of beta distribution?

The beta distribution is a family of continuous probability distributions set on the interval [0, 1] having two positive shape parameters, expressed by α and β. These two parameters appear as exponents of the random variable and manage the shape of the distribution.

What is beta distribution example?

A Beta distribution is a versatile way to represent outcomes for percentages or proportions. For example, how likely is it that a rogue candidate will win the next Presidential election? You might think the probability is 0.2. Your friend might think it's 0.15.


3 Answers

The problem is that beta.pdf() sometimes returns 0 and inf for 0 and 1. For example:

>>> from scipy.stats import beta
>>> beta.pdf(1,1.05,0.95)
/usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:1165: RuntimeWarning: divide by zero encountered in power
  Px = (1.0-x)**(b-1.0) * x**(a-1.0)
inf
>>> beta.pdf(0,1.05,0.95)
0.0

You're guaranteeing that you will have one data sample at 0 and 1 by your normalization process. Although you "correct" for values at which the pdf is 0, you are not correcting for those which return inf. To account for this you can just remove all the values which are not finite:

def betaNLL(param,*args):
    """
    Negative log likelihood function for beta
    <param>: list for parameters to be fitted.
    <args>: 1-element array containing the sample data.

    Return <nll>: negative log-likelihood to be minimized.
    """

    a, b = param
    data = args[0]
    pdf = beta.pdf(data,a,b,loc=0,scale=1)
    lg = np.log(pdf)
    mask = np.isfinite(lg)
    nll = -lg[mask].sum()
    return nll

beta fit

Really you shouldn't be normalizing like this though, because you are essentially throwing two data points out of the fit.

like image 158
user545424 Avatar answered Oct 16 '22 23:10

user545424


Without a docstring for beta.fit, it was a little tricky to find, but if you know the upper and lower limits you want to force upon beta.fit, you can use the kwargs floc and fscale.

I ran your code only using the beta.fit method, but with and without the floc and fscale kwargs. Also, I checked it with the arguments as ints and floats to make sure that wouldn't affect your answer. It didn't (on this test. I can't say if it never would.)

>>> from scipy.stats import beta
>>> import numpy
>>> def betaNLL(param,*args):
    '''Negative log likelihood function for beta
    <param>: list for parameters to be fitted.
    <args>: 1-element array containing the sample data.

    Return <nll>: negative log-likelihood to be minimized.
    '''

    a,b=param
    data=args[0]
    pdf=beta.pdf(data,a,b,loc=0,scale=1)
    lg=numpy.log(pdf)
    #-----Replace -inf with 0s------
    lg=numpy.where(lg==-numpy.inf,0,lg)
    nll=-1*numpy.sum(lg)
    return nll

>>> data=beta.rvs(5,2,loc=0,scale=1,size=500)
>>> beta.fit(data)
(5.696963536654355, 2.0005252702837009, -0.060443307228404922, 1.0580278414086459)
>>> beta.fit(data,floc=0,fscale=1)
(5.0952451826831462, 1.9546341057106007, 0, 1)
>>> beta.fit(data,floc=0.,fscale=1.)
(5.0952451826831462, 1.9546341057106007, 0.0, 1.0)

In conclusion, it seems this doesn't change your data (through normalization) or throw out data. I just think it should be noted that care should be taken when using this. In your case, you knew the limits were 0 and 1 because you got data out of a defined distribution that was between 0 and 1. In other cases, limits might be known, but if they are not known, beta.fit will provide them. In this case, without specifying the limits of 0 and 1, beta.fit calculated them to be loc=-0.06 and scale=1.058.

like image 22
jdj081 Avatar answered Oct 17 '22 00:10

jdj081


I used the method proposed in doi:10.1080/00949657808810232 to fir the beta parameters:

from scipy.special import psi
from scipy.special import polygamma
from scipy.optimize import root_scalar
from numpy.random import beta
import numpy as np

def ipsi(y):
    if y >= -2.22:
        x = np.exp(y) + 0.5
    else:
        x = - 1/ (y + psi(1))  
    for i in range(5):
        x = x - (psi(x) - y)/(polygamma(1,x))
    return x
        
#%%
# q satisface
# psi(q) - psi(ipsi(lng1 - lng2 + psi(q)) + q) -lng2 = 0 
# O sea, busco raíz de 
# f(q) = psi(q) - psi(ipsi(lng1 - lng2 + psi(q)) + q) -lng2
# luego:
# p = ipsi(lng1 - lng2 + psi(q))
def f(q,lng1,lng2):
    return psi(q) - psi(ipsi(lng1 - lng2 + psi(q)) + q) -lng2

#%%
def ml_beta_pq(sample):
    lng1 = np.log(sample).mean()
    lng2 = np.log(1-sample).mean()
    def g(q):
        return f(q,lng1,lng2)
    q=root_scalar(g,x0=1,x1=1.1).root
    p = ipsi(lng1 - lng2 + psi(q))
    return p, q

#%%
p = 2
q = 5
n = 1500
sample = beta(p,q,n)
ps,qs = ml_beta_pq(sample) #s de sombrero

print(f'Estimación de parámetros de una beta({p}, {q}) \na partir de una muestra de tamaño n = {n}')
print(f'\nn ={n:5d} |   p   |   q')
print(f'---------+-------+------')
print(f'original | {p:2.3f} | {q:2.3f}')
print(f'estimado | {ps:2.3f} | {qs:2.3f}')
like image 41
rgrimson Avatar answered Oct 17 '22 01:10

rgrimson