I have a nonlinear model fit that looks like this:
The dark solid line is the model fit, and the grey part is the raw data.
Short version of the question: how do I get the likelihood of this model fit, so I can perform log-likelihood ratio test? Assume that the residual is normally distributed.
I am relatively new to statistics, and my current thoughts are:
Get the residual from the curve fit, and calculate the variance of residual;
Use this equation And plug in the variance of residual into sigma-squared, x_i as experiment and mu as model fit;
Calculate the log-likelihood ratio.
Could anyone help me, with these two full-version questions?
Is my method correct? (I think so, but it would be really great to make sure!)
Are there any ready-made functions in python/scipy/statsmodels to do this for me?
Curve fitting involves finding the optimal parameters to a function that maps examples of inputs to outputs. The SciPy Python library provides an API to fit a curve to a dataset.
To determine the best fit, you should examine both the graphical and numerical fit results. Determine the best fit by examining the graphs of the fits and residuals. The graphical fit results indicate that: The fits and residuals for the polynomial equations are all similar, making it difficult to choose the best one.
Your likelihood function
which is simply the sum of log of probability density function of Gaussian distribution.
is the likelihood of fitting a mu and a sigma for your residue, not the likelihood of your model given your data. In one word, your approach is wrong.
Sine you are doing non-linear least square, following what @usethedeathstar already mentioned, you should go straight for F-test
. . Consider the following example, modified from http://www.walkingrandomly.com/?p=5254, and we conduct F-test
using R
. And we will discuss how to translate it into python
in the end.
# construct the data vectors using c()
> xdata = c(-2,-1.64,-1.33,-0.7,0,0.45,1.2,1.64,2.32,2.9)
> ydata = c(0.699369,0.700462,0.695354,1.03905,1.97389,2.41143,1.91091,0.919576,-0.730975,-1.42001)
# some starting values
> p1 = 1
> p2 = 0.2
> p3 = 0.01
# do the fit
> fit1 = nls(ydata ~ p1*cos(p2*xdata) + p2*sin(p1*xdata), start=list(p1=p1,p2=p2))
> fit2 = nls(ydata ~ p1*cos(p2*xdata) + p2*sin(p1*xdata)+p3*xdata, start=list(p1=p1,p2=p2,p3=p3))
# summarise
> summary(fit1)
Formula: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata)
Parameters:
Estimate Std. Error t value Pr(>|t|)
p1 1.881851 0.027430 68.61 2.27e-12 ***
p2 0.700230 0.009153 76.51 9.50e-13 ***
---
Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 0.08202 on 8 degrees of freedom
Number of iterations to convergence: 7
Achieved convergence tolerance: 2.189e-06
> summary(fit2)
Formula: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata) + p3 * xdata
Parameters:
Estimate Std. Error t value Pr(>|t|)
p1 1.90108 0.03520 54.002 1.96e-10 ***
p2 0.70657 0.01167 60.528 8.82e-11 ***
p3 0.02029 0.02166 0.937 0.38
---
Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 0.08243 on 7 degrees of freedom
Number of iterations to convergence: 9
Achieved convergence tolerance: 2.476e-06
> anova(fit2, fit1)
Analysis of Variance Table
Model 1: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata) + p3 * xdata
Model 2: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata)
Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)
1 7 0.047565
2 8 0.053813 -1 -0.0062473 0.9194 0.3696
here we have two model, fit1
has 2 parameters, therefore the residue has 8 degrees-of-freedom; fit2
has one additional parameter and the residue has 7 degrees of freedom. Is model 2 significantly better? No, the F value is 0.9194, on (1,7)
degrees of freedom and it is not significant.
To get the ANOVA table: Residue DF is easy. Residue Sum of squares: 0.08202*0.08202*8=0.05381
and 0.08243*0.08243*7=0.04756293
(notice: 'Residual standard error: 0.08243 on 7 degrees of freedom', etc). In python
, you can get it by (y_observed-y_fitted)**2
, since scipy.optimize.curve_fit()
doesn't return the residues.
The F-ratio
is 0.0062473/0.047565*7
and to get P-value: 1-scipy.stats.f.cdf(0.9194, 1, 7)
.
Put them together we have python
equivalent:
In [1]:
import scipy.optimize as so
import scipy.stats as ss
xdata = np.array([-2,-1.64,-1.33,-0.7,0,0.45,1.2,1.64,2.32,2.9])
ydata = np.array([0.699369,0.700462,0.695354,1.03905,1.97389,2.41143,1.91091,0.919576,-0.730975,-1.42001])
def model0(x,p1,p2):
return p1*np.cos(p2*x) + p2*np.sin(p1*x)
def model1(x,p1,p2,p3):
return p1*np.cos(p2*x) + p2*np.sin(p1*x)+p3*x
p1, p2, p3 = 1, 0.2, 0.01
fit0=so.curve_fit(model0, xdata, ydata, p0=(p1,p2))[0]
fit1=so.curve_fit(model1, xdata, ydata, p0=(p1,p2,p3))[0]
yfit0=model0(xdata, fit0[0], fit0[1])
yfit1=model1(xdata, fit1[0], fit1[1], fit1[2])
ssq0=((yfit0-ydata)**2).sum()
ssq1=((yfit1-ydata)**2).sum()
df=len(xdata)-3
f_ratio=(ssq0-ssq1)/(ssq1/df)
p=1-ss.f.cdf(f_ratio, 1, df)
In [2]:
print f_ratio, p
0.919387419515 0.369574503394
As @usethedeathstar pointed out: when you the residue is normally distributed, nonlinear least square IS the maximum likelihood. Therefore F-test and likelihood ratio test is equivalent. Because, F-ratio is a monotone transformation of the likelihood ratio λ.
Or in a descriptive way, see: http://www.stata.com/support/faqs/statistics/chi-squared-and-f-distributions/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With