I am trying to get a correct way of fitting a beta distribution. It's not a real world problem i am just testing the effects of a few different methods, and in doing this something is puzzling me.
Here is the python code I am working on, in which I tested 3 different approaches: 1>: fit using moments (sample mean and variance). 2>: fit by minimizing the negative log-likelihood (by using scipy.optimize.fmin()). 3>: simply call scipy.stats.beta.fit()
from scipy.optimize import fmin
from scipy.stats import beta
from scipy.special import gamma as gammaf
import matplotlib.pyplot as plt
import numpy
def betaNLL(param,*args):
'''Negative log likelihood function for beta
<param>: list for parameters to be fitted.
<args>: 1-element array containing the sample data.
Return <nll>: negative log-likelihood to be minimized.
'''
a,b=param
data=args[0]
pdf=beta.pdf(data,a,b,loc=0,scale=1)
lg=numpy.log(pdf)
#-----Replace -inf with 0s------
lg=numpy.where(lg==-numpy.inf,0,lg)
nll=-1*numpy.sum(lg)
return nll
#-------------------Sample data-------------------
data=beta.rvs(5,2,loc=0,scale=1,size=500)
#----------------Normalize to [0,1]----------------
#data=(data-numpy.min(data))/(numpy.max(data)-numpy.min(data))
#----------------Fit using moments----------------
mean=numpy.mean(data)
var=numpy.var(data,ddof=1)
alpha1=mean**2*(1-mean)/var-mean
beta1=alpha1*(1-mean)/mean
#------------------Fit using mle------------------
result=fmin(betaNLL,[1,1],args=(data,))
alpha2,beta2=result
#----------------Fit using beta.fit----------------
alpha3,beta3,xx,yy=beta.fit(data)
print '\n# alpha,beta from moments:',alpha1,beta1
print '# alpha,beta from mle:',alpha2,beta2
print '# alpha,beta from beta.fit:',alpha3,beta3
#-----------------------Plot-----------------------
plt.hist(data,bins=30,normed=True)
fitted=lambda x,a,b:gammaf(a+b)/gammaf(a)/gammaf(b)*x**(a-1)*(1-x)**(b-1) #pdf of beta
xx=numpy.linspace(0,max(data),len(data))
plt.plot(xx,fitted(xx,alpha1,beta1),'g')
plt.plot(xx,fitted(xx,alpha2,beta2),'b')
plt.plot(xx,fitted(xx,alpha3,beta3),'r')
plt.show()
The problem I have is about the normalization process (z=(x-a)/(b-a)
) where a
and b
are the min and max of the sample, respectively.
When I don't do the normalization, everything works Ok, there are slight differences among different fitting methods, by reasonably good.
But when I did the normalization, here is the result plot I got.
Only the moment method (green line) looks Ok.
The scipy.stats.beta.fit() method (red line) is uniform always, no matter what parameters I use to generate the random numbers.
And the MLE (blue line) fails.
So it seems like the normalization is creating these issues. But I think it is legal to have x=0
and x=1
in the beta distribution. And if given a real world problem, isn't it the 1st step to normalize the sample observations to make it in between [0,1] ? In that case, how should I fit the curve?
The beta distribution is a bounded continuous distribution. If is often used to express an uncertainty in a proportion, frequency, or percentage, which are all quantities between 0 and 1.
The Formula for the Beta Distribution 'A' and 'b' are used for representing lower and the upper bounds respectively for the distribution. B(p, q) is the beta function. The beta function has this formula: B(α,β)=∫01t(α−1)(1−t)(β−1)dt.
The beta distribution is a family of continuous probability distributions set on the interval [0, 1] having two positive shape parameters, expressed by α and β. These two parameters appear as exponents of the random variable and manage the shape of the distribution.
A Beta distribution is a versatile way to represent outcomes for percentages or proportions. For example, how likely is it that a rogue candidate will win the next Presidential election? You might think the probability is 0.2. Your friend might think it's 0.15.
The problem is that beta.pdf()
sometimes returns 0
and inf
for 0
and 1
. For example:
>>> from scipy.stats import beta
>>> beta.pdf(1,1.05,0.95)
/usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:1165: RuntimeWarning: divide by zero encountered in power
Px = (1.0-x)**(b-1.0) * x**(a-1.0)
inf
>>> beta.pdf(0,1.05,0.95)
0.0
You're guaranteeing that you will have one data sample at 0
and 1
by your normalization process. Although you "correct" for values at which the pdf is 0
, you are not correcting for those which return inf
. To account for this you can just remove all the values which are not finite:
def betaNLL(param,*args):
"""
Negative log likelihood function for beta
<param>: list for parameters to be fitted.
<args>: 1-element array containing the sample data.
Return <nll>: negative log-likelihood to be minimized.
"""
a, b = param
data = args[0]
pdf = beta.pdf(data,a,b,loc=0,scale=1)
lg = np.log(pdf)
mask = np.isfinite(lg)
nll = -lg[mask].sum()
return nll
Really you shouldn't be normalizing like this though, because you are essentially throwing two data points out of the fit.
Without a docstring for beta.fit
, it was a little tricky to find, but if you know the upper and lower limits you want to force upon beta.fit
, you can use the kwargs floc
and fscale
.
I ran your code only using the beta.fit
method, but with and without the floc and fscale kwargs. Also, I checked it with the arguments as ints and floats to make sure that wouldn't affect your answer. It didn't (on this test. I can't say if it never would.)
>>> from scipy.stats import beta
>>> import numpy
>>> def betaNLL(param,*args):
'''Negative log likelihood function for beta
<param>: list for parameters to be fitted.
<args>: 1-element array containing the sample data.
Return <nll>: negative log-likelihood to be minimized.
'''
a,b=param
data=args[0]
pdf=beta.pdf(data,a,b,loc=0,scale=1)
lg=numpy.log(pdf)
#-----Replace -inf with 0s------
lg=numpy.where(lg==-numpy.inf,0,lg)
nll=-1*numpy.sum(lg)
return nll
>>> data=beta.rvs(5,2,loc=0,scale=1,size=500)
>>> beta.fit(data)
(5.696963536654355, 2.0005252702837009, -0.060443307228404922, 1.0580278414086459)
>>> beta.fit(data,floc=0,fscale=1)
(5.0952451826831462, 1.9546341057106007, 0, 1)
>>> beta.fit(data,floc=0.,fscale=1.)
(5.0952451826831462, 1.9546341057106007, 0.0, 1.0)
In conclusion, it seems this doesn't change your data (through normalization) or throw out data. I just think it should be noted that care should be taken when using this. In your case, you knew the limits were 0 and 1 because you got data out of a defined distribution that was between 0 and 1. In other cases, limits might be known, but if they are not known, beta.fit
will provide them. In this case, without specifying the limits of 0 and 1, beta.fit
calculated them to be loc=-0.06
and scale=1.058
.
I used the method proposed in doi:10.1080/00949657808810232 to fir the beta parameters:
from scipy.special import psi
from scipy.special import polygamma
from scipy.optimize import root_scalar
from numpy.random import beta
import numpy as np
def ipsi(y):
if y >= -2.22:
x = np.exp(y) + 0.5
else:
x = - 1/ (y + psi(1))
for i in range(5):
x = x - (psi(x) - y)/(polygamma(1,x))
return x
#%%
# q satisface
# psi(q) - psi(ipsi(lng1 - lng2 + psi(q)) + q) -lng2 = 0
# O sea, busco raíz de
# f(q) = psi(q) - psi(ipsi(lng1 - lng2 + psi(q)) + q) -lng2
# luego:
# p = ipsi(lng1 - lng2 + psi(q))
def f(q,lng1,lng2):
return psi(q) - psi(ipsi(lng1 - lng2 + psi(q)) + q) -lng2
#%%
def ml_beta_pq(sample):
lng1 = np.log(sample).mean()
lng2 = np.log(1-sample).mean()
def g(q):
return f(q,lng1,lng2)
q=root_scalar(g,x0=1,x1=1.1).root
p = ipsi(lng1 - lng2 + psi(q))
return p, q
#%%
p = 2
q = 5
n = 1500
sample = beta(p,q,n)
ps,qs = ml_beta_pq(sample) #s de sombrero
print(f'Estimación de parámetros de una beta({p}, {q}) \na partir de una muestra de tamaño n = {n}')
print(f'\nn ={n:5d} | p | q')
print(f'---------+-------+------')
print(f'original | {p:2.3f} | {q:2.3f}')
print(f'estimado | {ps:2.3f} | {qs:2.3f}')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With