I have a set of data points, which appear sort of like a line with a curve near the beginning. See the image below, which shows the points with a line of best fit (fit to the whole data set).
Instead, they could be described by two linear functions (a line through the leftmost set of points and a separate line through the rest of the data points). What these points actually correspond to are neutron decay, which contains two different isotopes.
I don't know which points correspond to which isotope, so I need to somehow make a best guess. The curve for one isotope will be a straight line, and the curve for the other will be a different straight line. How can I fit two different lines of best fit (linear) to the set of data points, such that the fitting for both is optimized?
One idea I had is to pick a "cutouff point", say at t=100
(x-axis), and fit the points on the left to a line, and the points to the right to a different line. I could then compute chi^2 for both lines in order to get the "goodness" of the fits. Then, I could just keep doing the same thing many times with slightly different cutoff points, until I find the pair of lines that gives the best overall fits.
A different idea, which seems more involved, is to describe these data points as a combination of the two lines, y= m1*t + m2*t + b1 + b2
, where the m
s are the slopes and b
s are the y-intercepts. Then, taking the derivative of the total curve, I would have dy/dt = m1+m2
. Then perhaps I could cycle through different "cutouff points", and fit to lines until I get a combination where the derivative is closest to m1+m2
. But I'm not sure how to do this, since I'm not working with one curve initially, just a bunch of discrete data points.
What would be the best way to go about this optimization of two fits in Python?
One way to approximate our linear function is to sketch the line that seems to best fit the data. Then we can extend the line until we can verify the y-intercept. We can approximate the slope of the line by extending it until we can estimate the riserun .
Nonlinear regression is a mathematical function that uses a generated line – typically a curve – to fit an equation to some data. The sum of squares is used to determine the fitness of a regression model, which is computed by calculating the difference between the mean and every point of data.
The equation of a line of best fit can be represented as y=mx+b y = m x + b , where m is the slope and b is the y-intercept. We will take a look at two examples show a scatter plot with a line of best fit to understand the concept of how to approximate the equation of a line of best fit and make predictions.
To determine the line of best fit, we use the method of least squares to find coefficients a and b that minimize the sum of squared differences between each observation's actual y-value and its predicted y-value according to the linear model y = a + b*x .
This can be interpreted as a time-series segmentation problem in combination with linear regression. There are multiple approaches to tackle this problem. One of these you already mentioned: a manually selected point where to segment the data, another one is trying to minimize the error.
First I try to recreate the data:
import numpy as np; import matplotlib.pyplot as plt
y1 = np.linspace(5.5, 3.7, num=100)
y1 = y1 + np.random.rand(y1.shape[0]) * np.linspace(0, .3, num=y1.shape[0])
y2 = np.linspace(3.7, 1.1, num=500)
y2 = y2 + np.random.rand(y2.shape[0]) * np.linspace(0.3, 1.9, num=y2.shape[0])
y = np.append(y1, y2)
x = np.array(range(len(y)))
Then I do two linear fits using numpy.linalg.lstsq
which itself is based on the method of least squares:
x1 = x[:100]
y1 = y[:100]
A1 = np.vstack([x1, np.ones(len(x1))]).T
m1, c1 = np.linalg.lstsq(A1, y1, rcond=None)[0]
x2 = x[100:]
y2 = y[100:]
A2 = np.vstack([x2, np.ones(len(x2))]).T
m2, c2 = np.linalg.lstsq(A2, y2, rcond=None)[0]
Plotting this results in the following image:
plt.scatter(x, y)
plt.plot(x1, m1 * x1 + c1, 'r')
plt.plot(x2, m2 * x2 + c2, 'r')
plt.show()
You now could use an automatic segmentation algorithm like SWAB to replace the [100:]
and [:100]
slices, but I'd advise against it, if you are able to manually decide, at which point to split the data. Have a look at the link provided, if you're looking for an implementation.
Here is an example of fitting two straight lines to one data set, with the crossover point between the two lines also fitted as a parameter. This example uses scipy's Differential Evolution (DE) genetic algorithm to determine initial parameter estimates. The scipy implementation of DE uses the Latin Hypercube algorithm to ensure a thorough search of parameter space, and this algorithm requires bounds within which to search - in this example those bounds are taken from the data maximum and minimum values.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.optimize import differential_evolution
import warnings
xData = numpy.array([19.1647, 18.0189, 16.9550, 15.7683, 14.7044, 13.6269, 12.6040, 11.4309, 10.2987, 9.23465, 8.18440, 7.89789, 7.62498, 7.36571, 7.01106, 6.71094, 6.46548, 6.27436, 6.16543, 6.05569, 5.91904, 5.78247, 5.53661, 4.85425, 4.29468, 3.74888, 3.16206, 2.58882, 1.93371, 1.52426, 1.14211, 0.719035, 0.377708, 0.0226971, -0.223181, -0.537231, -0.878491, -1.27484, -1.45266, -1.57583, -1.61717])
yData = numpy.array([0.644557, 0.641059, 0.637555, 0.634059, 0.634135, 0.631825, 0.631899, 0.627209, 0.622516, 0.617818, 0.616103, 0.613736, 0.610175, 0.606613, 0.605445, 0.603676, 0.604887, 0.600127, 0.604909, 0.588207, 0.581056, 0.576292, 0.566761, 0.555472, 0.545367, 0.538842, 0.529336, 0.518635, 0.506747, 0.499018, 0.491885, 0.484754, 0.475230, 0.464514, 0.454387, 0.444861, 0.437128, 0.415076, 0.401363, 0.390034, 0.378698])
def func(xArray, breakpoint, slopeA, offsetA, slopeB, offsetB):
returnArray = []
for x in xArray:
if x < breakpoint:
returnArray.append(slopeA * x + offsetA)
else:
returnArray.append(slopeB * x + offsetB)
return returnArray
# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
val = func(xData, *parameterTuple)
return numpy.sum((yData - val) ** 2.0)
def generate_Initial_Parameters():
# min and max used for bounds
maxX = max(xData)
minX = min(xData)
maxY = max(yData)
minY = min(yData)
slope = 10.0 * (maxY - minY) / (maxX - minX) # times 10 for safety margin
parameterBounds = []
parameterBounds.append([minX, maxX]) # search bounds for breakpoint
parameterBounds.append([-slope, slope]) # search bounds for slopeA
parameterBounds.append([minY, maxY]) # search bounds for offsetA
parameterBounds.append([-slope, slope]) # search bounds for slopeB
parameterBounds.append([minY, maxY]) # search bounds for offsetB
result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
return result.x
# by default, differential_evolution completes by calling curve_fit() using parameter bounds
geneticParameters = generate_Initial_Parameters()
# call curve_fit without passing bounds from genetic algorithm
fittedParameters, pcov = curve_fit(func, xData, yData, geneticParameters)
print('Parameters:', fittedParameters)
print()
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print()
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With