Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linear fit including all errors with NumPy/SciPy

I have a lot of x-y data points with errors on y that I need to fit non-linear functions to. Those functions can be linear in some cases, but are more usually exponential decay, gauss curves and so on. SciPy supports this kind of fitting with scipy.optimize.curve_fit, and I can also specify the weight of each point. This gives me weighted non-linear fitting which is great. From the results, I can extract the parameters and their respective errors.

There is just one caveat: The errors are only used as weights, but not included in the error. If I double the errors on all of my data points, I would expect that the uncertainty of the result increases as well. So I built a test case (source code) to test this.

Fit with scipy.optimize.curve_fit gives me:

Parameters: [ 1.99900756  2.99695535]
Errors:     [ 0.00424833  0.00943236]

Same but with 2 * y_err:

Parameters: [ 1.99900756  2.99695535]
Errors:     [ 0.00424833  0.00943236]

Same but with 2 * y_err:

So you can see that the values are identical. This tells me that the algorithm does not take those into account, but I think the values should be different.

I read about another fit method here as well, so I tried to fit with scipy.odr as well:

Beta: [ 2.00538124  2.95000413]
Beta Std Error: [ 0.00652719  0.03870884]

Same but with 20 * y_err:

Beta: [ 2.00517894  2.9489472 ]
Beta Std Error: [ 0.00642428  0.03647149]

The values are slightly different, but I do think that this accounts for the increase in the error at all. I think that this is just rounding errors or a little different weighting.

Is there some package that allows me to fit the data and get the actual errors? I have the formulas here in a book, but I do not want to implement this myself if I do not have to.


I have now read about linfit.py in another question. This handles what I have in mind quite well. It supports both modes, and the first one is what I need.

Fit with linfit:
Parameters: [ 2.02600849  2.91759066]
Errors:     [ 0.00772283  0.04449971]

Same but with 20 * y_err:
Parameters: [ 2.02600849  2.91759066]
Errors:     [ 0.15445662  0.88999413]

Fit with linfit(relsigma=True):
Parameters: [ 2.02600849  2.91759066]
Errors:     [ 0.00622595  0.03587451]

Same but with 20 * y_err:
Parameters: [ 2.02600849  2.91759066]
Errors:     [ 0.00622595  0.03587451]

Should I answer my question or just close/delete it now?

like image 388
Martin Ueding Avatar asked May 30 '14 10:05

Martin Ueding


2 Answers

Please note that, from the documentation of curvefit:

sigma : None or N-length sequence If not None, this vector will be used as relative weights in the least-squares problem.

The key point here is as relative weights, therefore, yerr in line 53 and 2*yerr in 57 should give you similar, if not the same result.

When you increase the actually residue error, you will see the values in the covariance matrix grow large. Say if we change the y += random to y += 5*random in function generate_data():

Fit with scipy.optimize.curve_fit:
('Parameters:', array([ 1.92810458,  3.97843448]))
('Errors:    ', array([ 0.09617346,  0.64127574]))

Compares to the original result:

Fit with scipy.optimize.curve_fit:
('Parameters:', array([ 2.00760386,  2.97817514]))
('Errors:    ', array([ 0.00782591,  0.02983339]))

Also notice that the parameter estimate is now further off from (2,3), as we would expect from increased residue error and larger confidence interval of parameter estimates.

like image 53
CT Zhu Avatar answered Sep 29 '22 21:09

CT Zhu


One way that works well and actually gives a better result is the bootstrap method. When data points with errors are given, one uses a parametric bootstrap and let each x and y value describe a Gaussian distribution. Then one will draw a point from each of those distributions and obtains a new bootstrapped sample. Performing a simple unweighted fit gives one value for the parameters.

This process is repeated some 300 to a couple thousand times. One will end up with a distribution of the fit parameters where one can take mean and standard deviation to obtain value and error.

Another neat thing is that one does not obtain a single fit curve as a result, but lots of them. For each interpolated x value one can again take mean and standard deviation of the many values f(x, param) and obtain an error band:

enter image description here

Further steps in the analysis are then performed again hundreds of times with the various fit parameters. This will then also take into account the correlation of the fit parameters as one can see clearly in the plot above: Although a symmetric function was fitted to the data, the error band is asymmetric. This will mean that interpolated values on the left have a larger uncertainty than on the right.

like image 26
Martin Ueding Avatar answered Sep 29 '22 23:09

Martin Ueding