How to properly fit a beta distribution in python?

Tags:

I am trying to get a correct way of fitting a beta distribution. It's not a real world problem i am just testing the effects of a few different methods, and in doing this something is puzzling me.

Here is the python code I am working on, in which I tested 3 different approaches: 1>: fit using moments (sample mean and variance). 2>: fit by minimizing the negative log-likelihood (by using scipy.optimize.fmin()). 3>: simply call scipy.stats.beta.fit()

from scipy.optimize import fmin
from scipy.stats import beta
from scipy.special import gamma as gammaf
import matplotlib.pyplot as plt
import numpy


def betaNLL(param,*args):
    '''Negative log likelihood function for beta
    <param>: list for parameters to be fitted.
    <args>: 1-element array containing the sample data.

    Return <nll>: negative log-likelihood to be minimized.
    '''

    a,b=param
    data=args[0]
    pdf=beta.pdf(data,a,b,loc=0,scale=1)
    lg=numpy.log(pdf)
    #-----Replace -inf with 0s------
    lg=numpy.where(lg==-numpy.inf,0,lg)
    nll=-1*numpy.sum(lg)
    return nll

#-------------------Sample data-------------------
data=beta.rvs(5,2,loc=0,scale=1,size=500)

#----------------Normalize to [0,1]----------------
#data=(data-numpy.min(data))/(numpy.max(data)-numpy.min(data))

#----------------Fit using moments----------------
mean=numpy.mean(data)
var=numpy.var(data,ddof=1)
alpha1=mean**2*(1-mean)/var-mean
beta1=alpha1*(1-mean)/mean

#------------------Fit using mle------------------
result=fmin(betaNLL,[1,1],args=(data,))
alpha2,beta2=result

#----------------Fit using beta.fit----------------
alpha3,beta3,xx,yy=beta.fit(data)

print '\n# alpha,beta from moments:',alpha1,beta1
print '# alpha,beta from mle:',alpha2,beta2
print '# alpha,beta from beta.fit:',alpha3,beta3

#-----------------------Plot-----------------------
plt.hist(data,bins=30,normed=True)
fitted=lambda x,a,b:gammaf(a+b)/gammaf(a)/gammaf(b)*x**(a-1)*(1-x)**(b-1) #pdf of beta

xx=numpy.linspace(0,max(data),len(data))
plt.plot(xx,fitted(xx,alpha1,beta1),'g')
plt.plot(xx,fitted(xx,alpha2,beta2),'b')
plt.plot(xx,fitted(xx,alpha3,beta3),'r')

plt.show()

The problem I have is about the normalization process (z=(x-a)/(b-a)) where a and b are the min and max of the sample, respectively.

When I don't do the normalization, everything works Ok, there are slight differences among different fitting methods, by reasonably good.

But when I did the normalization, here is the result plot I got.

Plot

Only the moment method (green line) looks Ok.

The scipy.stats.beta.fit() method (red line) is uniform always, no matter what parameters I use to generate the random numbers.

And the MLE (blue line) fails.

So it seems like the normalization is creating these issues. But I think it is legal to have x=0 and x=1 in the beta distribution. And if given a real world problem, isn't it the 1st step to normalize the sample observations to make it in between [0,1] ? In that case, how should I fit the curve?

973

asked Apr 27 '14 21:04

Jason

3 Answers

The problem is that beta.pdf() sometimes returns 0 and inf for 0 and 1. For example:

>>> from scipy.stats import beta
>>> beta.pdf(1,1.05,0.95)
/usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:1165: RuntimeWarning: divide by zero encountered in power
  Px = (1.0-x)**(b-1.0) * x**(a-1.0)
inf
>>> beta.pdf(0,1.05,0.95)
0.0

You're guaranteeing that you will have one data sample at 0 and 1 by your normalization process. Although you "correct" for values at which the pdf is 0, you are not correcting for those which return inf. To account for this you can just remove all the values which are not finite:

def betaNLL(param,*args):
    """
    Negative log likelihood function for beta
    <param>: list for parameters to be fitted.
    <args>: 1-element array containing the sample data.

    Return <nll>: negative log-likelihood to be minimized.
    """

    a, b = param
    data = args[0]
    pdf = beta.pdf(data,a,b,loc=0,scale=1)
    lg = np.log(pdf)
    mask = np.isfinite(lg)
    nll = -lg[mask].sum()
    return nll

beta fit

Really you shouldn't be normalizing like this though, because you are essentially throwing two data points out of the fit.

158

answered Oct 16 '22 23:10

user545424

Without a docstring for beta.fit, it was a little tricky to find, but if you know the upper and lower limits you want to force upon beta.fit, you can use the kwargs floc and fscale.

I ran your code only using the beta.fit method, but with and without the floc and fscale kwargs. Also, I checked it with the arguments as ints and floats to make sure that wouldn't affect your answer. It didn't (on this test. I can't say if it never would.)

>>> from scipy.stats import beta
>>> import numpy
>>> def betaNLL(param,*args):
    '''Negative log likelihood function for beta
    <param>: list for parameters to be fitted.
    <args>: 1-element array containing the sample data.

    Return <nll>: negative log-likelihood to be minimized.
    '''

    a,b=param
    data=args[0]
    pdf=beta.pdf(data,a,b,loc=0,scale=1)
    lg=numpy.log(pdf)
    #-----Replace -inf with 0s------
    lg=numpy.where(lg==-numpy.inf,0,lg)
    nll=-1*numpy.sum(lg)
    return nll

>>> data=beta.rvs(5,2,loc=0,scale=1,size=500)
>>> beta.fit(data)
(5.696963536654355, 2.0005252702837009, -0.060443307228404922, 1.0580278414086459)
>>> beta.fit(data,floc=0,fscale=1)
(5.0952451826831462, 1.9546341057106007, 0, 1)
>>> beta.fit(data,floc=0.,fscale=1.)
(5.0952451826831462, 1.9546341057106007, 0.0, 1.0)

In conclusion, it seems this doesn't change your data (through normalization) or throw out data. I just think it should be noted that care should be taken when using this. In your case, you knew the limits were 0 and 1 because you got data out of a defined distribution that was between 0 and 1. In other cases, limits might be known, but if they are not known, beta.fit will provide them. In this case, without specifying the limits of 0 and 1, beta.fit calculated them to be loc=-0.06 and scale=1.058.

answered Oct 17 '22 00:10

jdj081

I used the method proposed in doi:10.1080/00949657808810232 to fir the beta parameters:

from scipy.special import psi
from scipy.special import polygamma
from scipy.optimize import root_scalar
from numpy.random import beta
import numpy as np

def ipsi(y):
    if y >= -2.22:
        x = np.exp(y) + 0.5
    else:
        x = - 1/ (y + psi(1))  
    for i in range(5):
        x = x - (psi(x) - y)/(polygamma(1,x))
    return x
        
#%%
# q satisface
# psi(q) - psi(ipsi(lng1 - lng2 + psi(q)) + q) -lng2 = 0 
# O sea, busco raíz de 
# f(q) = psi(q) - psi(ipsi(lng1 - lng2 + psi(q)) + q) -lng2
# luego:
# p = ipsi(lng1 - lng2 + psi(q))
def f(q,lng1,lng2):
    return psi(q) - psi(ipsi(lng1 - lng2 + psi(q)) + q) -lng2

#%%
def ml_beta_pq(sample):
    lng1 = np.log(sample).mean()
    lng2 = np.log(1-sample).mean()
    def g(q):
        return f(q,lng1,lng2)
    q=root_scalar(g,x0=1,x1=1.1).root
    p = ipsi(lng1 - lng2 + psi(q))
    return p, q

#%%
p = 2
q = 5
n = 1500
sample = beta(p,q,n)
ps,qs = ml_beta_pq(sample) #s de sombrero

print(f'Estimación de parámetros de una beta({p}, {q}) \na partir de una muestra de tamaño n = {n}')
print(f'\nn ={n:5d} |   p   |   q')
print(f'---------+-------+------')
print(f'original | {p:2.3f} | {q:2.3f}')
print(f'estimado | {ps:2.3f} | {qs:2.3f}')

answered Oct 17 '22 01:10

rgrimson

Related questions
                            
                                multiprocessing and ctypes with pointers
                            
                                A library like Python's collections.Counter library for C# -> Getting the difference of values between two dictionary objects in C#
                            
                                How to parse the POST argument to a REST service?
                            
                                power spectrum by numpy.fft.fft
                            
                                Rest framework serializer always returns False for is_valid
                            
                                Color logging using logging module in Python
                            
                                Python argparse allow combined flags
                            
                                Python Kerberos-1.1.1.tar.gz Install Failure on Windows
                            
                                looping through nodes and extract attributes in Networkx
                            
                                Equivalent of set -o pipefail in Python?
                            
                                Python API Compatibility Checker
                            
                                urlopen error [Errno 11001] getaddrinfo failed
                            
                                Mismatch between sys.executable and sys.version in Python
                            
                                Celery tasks disappear
                            
                                Write decoded from base64 string to file
                            
                                Optimize code generated by sympy
                            
                                Auto kill process and child process of multiprocessing Pool
                            
                                Analyse frequency of mp3 files with python
                            
                                Python : compare two files with different line endings
                            
                                Real-time interrupts in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to properly fit a beta distribution in python?

Tags:

python

curve-fitting

beta-distribution