I've been trying to implement this algorithm in Python for a few days now. I keep coming back to it and just giving up and getting frustrated. I don't know whats going on. I don't have anyone to ask or anywhere to go for help so I've come here.
PDF Warning: http://www.cs.uiuc.edu/class/sp08/cs473/Lectures/lec10.pdf
I don't think its a clearly explained, I sure don't understand.
My understanding of what's happening is this:
We have a set of of points (x1,y1), (x2,y2).. and we want to find some lines that best fit this data. We can have multiple straight lines, and these lines come from the given forums for a and b (y = ax +b).
Now the algorithm starts at the end (?) and assumes that a point p(x_i, y_i) is part of the line segment. Then the notes say that the optimal solution is 'is optimal solution for {p1, . . . pi−1} plus (best) line through {pi , . . . pn}'. Which just means to me, that we go to the point p(x_i, y_i) and go backwards and find another line segment through the rest of the points. Now the optimal solution is both these line segments.
Then it takes a logical jump I can't follow, and says "Suppose the last point pn is part of a segment that starts at p_i. If Opt(j) denotes the cost of the first j points and e(j,k) the error of the best line through points j to k then Opt(n) = e(i, n) + C + Opt(i − 1)"
Then there is the pseudocode for the algorithm, which I don't understand. I understand that we want to iterate through the list of points and find the points which minimize the OPT(n) formula, but I just don't follow it. It's making me feel stupid.
I know this question is a pain in the ass and that it's not easy to answer but I'm just looking for some guidance towards understanding this algorithm. I apologize for the PDF but I don't have a neater way of getting the crucial information to the reader.
Thank you for your time and reading this, I appreciate it.
The least squares problem asks to find a single line of best fit for the given points and has a nice closed form solution. This solution is used as a primitive to solve the segmented least squares problem.
In the segmented least squares problem, we can have any number of segments to fit the given points and we have to pay a cost for each new segment used. If the cost of using a new segment was 0, we could perfectly fit all points by passing a separate line through each point.
Now suppose we are trying to find the set of segments that is a best fit to the n given points. If we knew the best solutions for n-1 subproblems: best fit for first point, best fit for first 2 points, ..., best fit for first n-1 points, then we could compute the best solution for the original problem with n points as follows:
The nth point lies on a single segment and this segment starts at some earlier point i, for some i = 1, 2, ..., n. We have already solved all smaller subproblems i.e., having less than n points so we can find the best fit for n points that minimizes:
cost of best fit for first i-1 points (already computed) + cost of single line that is best fit for points i to n (using least squares problem) + cost of using a new segment
The value of i that minimizes the above quantity gives us the solution: use the best fit to subproblem i-1 and the segment through point i to n.
If you need more help, I've written an explanation of the algorithm and provided C++ implementation here: http://kartikkukreja.wordpress.com/2013/10/21/segmented-least-squares-problem/
The tricky part, the dynamic programming part, is the section
for j = 1 to n
M[j] = min_(1=<i=<j) ( e(i,j) + C + M[i-1])
where M[0] = 0
is done earlier and n is the total number of data points. Also, the underscore means that the section in parentheses afterwards should be subscripted. The professor could well have used OPT instead of M, and this is done in other some other universities' lectures about this which you can find on the web (and which look almost identical). Now, if you look at my code block above and that in the lecture, you will notice a difference. I used M[i-1]
and your professor used M[j-1]
. It's a typo in your professor's pseudocode. You can even look back to the slide before and see that he had it correct in the error function there.
Basically, for each j, you are finding the point i to draw a line to j from such that the error of it, plus the cost of adding this extra line (C), plus the cost of making all line segments up to i (which has already been chosen optimally) is minimized.
Also, remember that e(i,i) = 0
as well as e(i,i+1)
because fitting a line to a point gives no error as well as to just two points.
Starting from point 1, the best solution up until a point j, must include the new end-point j in the last line segment, so the problem becomes where do I have to place the last split to minimize the cost of this last line-segment?
Fortunately the cost is calculated in terms of subproblems of the same problem you are trying to solve, and fortunately you've already solved these smaller subproblems by the time you've moved to the next point j. So at the new point j you are trying to find an optimal point i, between points 1 and j, to split off a new line segment that includes j, and minimizes the cost: optimal_cost_up_to(i) + cost_of_split + cost_of_lsq_fit(i+1,j). Now the confusing part is that at any point it might seem like you are just finding a single split, but in reality all the previous splits are determined by optimal_cost_up_to(i), and since you've already calculated the optimal cost for all these points leading up to j, then you just need to memoize the answers so that the algorithm doesn't have to recalculate these costs each time it advances a point.
In python I'd probably use a dictionary to store the results, although for this dynamic programming algorithm an array might be better...
anyways...
def optimalSolution(points,split_cost)
solutions = {0:{'cost':0,'splits':[]}}
for j in range(1,len(points)):
best_split = None
best_cost = lsqFitCost(points,0,j)
for i in range(0,j):
cost = solutions[i]['cost'] + split_cost + lsqFitCost(points,i+1,j)
if cost < best_cost:
best_cost = cost
best_split = i
if best_split != None:
solution[j] = {'cost':best_cost,'splits':solution[best_split]['splits']+[best_split]}
else:
solution[j] = {'cost':best_cost,'splits':[]}
return solutions
it's not pretty, and I haven't checked it (there might be an error involving the case where no split is the best cost), but hopefully it'll help get you on the right path? Note that lsqFitCost has to do a lot of work on each iteration, but for a least squares line fit like this you can make this cost a lot less by maintaining running sums used in the calculation... you should Google least squares line fitting for more info. This could make lsqFitCost constant so the overall time would be O(N^2).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With