Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What exactly is the variance on the parameters of SciPy curve fit? (Python)

I'm currently using the curve_fit function of the scipy.optimize package in Python, and know that if you take the square root of the diagonal entries of the covariance matrix that you get from curve_fit, you get the standard deviation on the parameters that curve_fit calculated. What I'm not sure about, is what exactly this standard deviation means. It's an approximation using a Hesse matrix as far as I understand, but what would the exact calculation be? Standard deviation on the Gaussian Bell Curve tells you what percentage of area is within a certain range of the curve, so I assumed for curve_fit it tells you how many datapoints are between certain parameter values, but apparently that isn't right...

I'm sorry if this should be basic knowledge for curve fitting, but I really can't figure out what the standard deviations do, they express an error on the parameters, but those parameters are calculated as the best possible fit for the function, it's not like there's a whole collection of optimal parameters, and we get the average value of that collection and consequently also have a standard deviation. There's only one optimal value, what is there to compare it with? I guess my question really comes down to this: how can I manually and accurately calculate these standard deviations, and not just get an approximation using a Hesse matrix?

like image 987
MMichelis Avatar asked Jun 23 '17 13:06

MMichelis


People also ask

What does Scipy curve fit return?

We can then call the curve_fit() function to fit a straight line to the dataset using our defined function. The function curve_fit() returns the optimal values for the mapping function, e.g, the coefficient values. It also returns a covariance matrix for the estimated parameters, but we can ignore that for now.

How does Scipy curve fit work?

scipy. optimize. curve_fit(func, x, y) will return a numpy array containing two arrays: the first will contain values for a and b that best fit your data, and the second will be the covariance of the optimal fit parameters. Here's an example for a linear fit with the data you provided.

What is covariance of a curve fit?

curve_fit is the estimated covariance of the parameter estimate, that is loosely speaking, given the data and a model, how much information is there in the data to determine the value of a parameter in the given model.

What is Scipy curve fit?

The SciPy API provides a 'curve_fit' function in its optimization library to fit the data with a given function. This method applies non-linear least squares to fit the data and extract the optimal parameters out of it.


1 Answers

The variance in the fitted parameters represents the uncertainty in the best-fit value based on the quality of the fit of the model to the data. That is, it describes by how much the value could change away from the best-fit value and still have a fit that is almost as good as the best-fit value.

With standard definition of chi-square,
chi_square = ( ( (data - model)/epsilon )**2 ).sum()

and reduced_chi_square = chi_square / (ndata - nvarys) (where data is the array of the data values, model the array of the calculated model, epsilon is uncertainty in the data, ndata is the number of data points, and nvarys the number of variables), a good fit should have reduced_chi_square around 1 or chi_square around ndata-nvary. (Note: not 0 -- the fit will not be perfect as there is noise in the data).

The variance in the best-fit value for a variable gives the amount by which you can change the value (and re-optimize all other values) and increase chi-square by 1. That gives the so-called '1-sigma' value of the uncertainty.

As you say, these values are expressed in the diagonal terms of the covariance matrix returned by scipy.optimize.curve_fit (the off-diagonal terms give the correlations between variables: if a value for one variable is changed away from its optimal value, how would the others respond to make the fit better). This covariance matrix is built using the trial values and derivatives near the solution as the fit is being done -- it calculates the "curvature" of the parameter space (ie, how much chi-square changes when a variables value changes).

You can calculate these uncertainties by hand. The lmfit library (https://lmfit.github.io/lmfit-py/) has routines to more explicitly explore the confidence intervals of variables from least-squares minimization or curve-fitting. These are described in more detail at https://lmfit.github.io/lmfit-py/confidence.html. It's probably easiest to use lmfit for the curve-fitting rather than trying to re-implement the confidence interval code for curve_fit.

like image 194
M Newville Avatar answered Nov 14 '22 21:11

M Newville