I am using Python 3.6 for data fitting. Recently, I came across the following problem and I’m lacking experience wherefore I’m not sure how to deal with this.
If I use numpy.polyfit(x, y, 1, cov=True) and scipy.curve_fit(lambda: x, a, b: a*x+b, x, y) on the same set of data points, I get nearly the same coefficients a and b. But the values of the covariance matrix of scipy.curve_fit are roughly half of the values from numpy.polyfit.
Since I want to use the diagonal of the covariance matrix to estimate the uncertainties (u = numpy.sqrt(numpy.diag(cov))) of the coefficients, I have three questions:
Thanks!
Edit:
import numpy as np
import scipy.optimize as sc
data = np.array([[1,2,3,4,5,6,7],[1.1,1.9,3.2,4.3,4.8,6.0,7.3]]).T
x=data[:,0]
y=data[:,1]
A=np.polyfit(x,y,1, cov=True)
print('Polyfit:', np.diag(A[1]))
B=sc.curve_fit(lambda x,a,b: a*x+b, x, y)
print('Curve_Fit:', np.diag(B[1]))
If I use the statsmodels.api
, the result corresponds to that of curve_fit.
The Python SciPy Optimize Curve Fit function is widely used to obtain the best-fit parameters. The curve_fit() function is an optimization function that is used to find the optimized parameter set for a stated function that perfectly fits the provided data set.
The SciPy API provides a 'curve_fit' function in its optimization library to fit the data with a given function. This method applies non-linear least squares to fit the data and extract the optimal parameters out of it.
1. What does popt and pcov mean? popt- An array of optimal values for the parameters which minimizes the sum of squares of residuals. pcov-2d array which contains the estimated covariance of popt. The diagonals provide the variance of the parameter estimate.
Introduction to NumPy polyfit. In python, Numpy polyfit() is a method that fits the data within a polynomial function. That is, it least squares the function polynomial fit. For example, a polynomial p(X) of deg degree fits the coordinate points (X, Y).
I imagine it has something to do with this
593 # Some literature ignores the extra -2.0 factor in the denominator, but
594 # it is included here because the covariance of Multivariate Student-T
595 # (which is implied by a Bayesian uncertainty analysis) includes it.
596 # Plus, it gives a slightly more conservative estimate of uncertainty.
597 if len(x) <= order + 2:
598 raise ValueError("the number of data points must exceed order + 2 "
599 "for Bayesian estimate the covariance matrix")
600 fac = resids / (len(x) - order - 2.0)
601 if y.ndim == 1:
602 return c, Vbase * fac
603 else:
604 return c, Vbase[:,:, NX.newaxis] * fac
As in this case len(x) - order
is 4 and (len(x) - order - 2.0)
is 2, that would explain why your values are different by a factor of 2.
This explains question 2. The answer to question 3 is likely "get more data.", as for larger len(x)
the difference will probably be negligible.
Which formulation is correct (question 1) is probably a question for Cross Validated, but I'd assume it is is curve_fit
as that is explicitly intended to calculate the uncertainties as you state. From the documentation
pcov : 2d array
The estimated covariance of popt. The diagonals provide the variance of the parameter estimate. To compute one standard deviation errors on the parameters use perr = np.sqrt(np.diag(pcov)).
While the comment in the code for polyfit
above says its intetention is more for Student-T analysis.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With