Give an algorithm that takes a sequence of points in the plane (x_1, y_1), (x_2, y_2), ...., (x_n, y_n) and an integer k as input and returns the best piecewise linear function f consisting of at most k pieces that minimizes the sum squared error. You may assume that you have access to an algorithm that computes the sum squared error for one segment through a set of n points in Θ(n) time.The solution should use O(n^2k) time and O(nk) space.
Can anyone help me with this problem? Thank you so much!
(This is too late for your homework, but hope it helps anyway.)
First is dynamic programming in python / numpy for k = 4
only,
to help you understand how dynamic programming works;
once you understand that, writing a loop for any k should be easy.
Also, Cost[]
is a 2d matrix, space O(n^2);
see the notes at the end for getting down to space O(n k)
#!/usr/bin/env python
""" split4.py: min-cost split into 4 pieces, dynamic programming k=4 """
from __future__ import division
import numpy as np
__version__ = "2014-03-09 mar denis"
#...............................................................................
def split4( Cost, verbose=1 ):
""" split4.py: min-cost split into 4 pieces, dynamic programming k=4
min Cost[0:a] + Cost[a:b] + Cost[b:c] + Cost[c:n]
Cost[a,b] = error in least-squares line fit to xy[a] .. xy[b] *including b*
or error in lsq horizontal lines, sum (y_j - av y) ^2 for each piece --
o--
o-
o---
o----
| | | |
0 2 5 9
(Why 4 ? to walk through step by step, then put in a loop)
"""
# speedup: maxlen 2 n/k or so
Cost = np.asanyarray(Cost)
n = Cost.shape[1]
# C2 C3 ... costs, J2 J3 ... indices of best splits
J2 = - np.ones(n, dtype=int) # -1, NaN mark undefined / bug
C2 = np.ones(n) * np.NaN
J3 = - np.ones(n, dtype=int)
C3 = np.ones(n) * np.NaN
# best 2-splits of the left 2 3 4 ...
for nleft in range( 1, n ):
J2[nleft] = j = np.argmin([ Cost[0,j-1] + Cost[j,nleft] for j in range( 1, nleft+1 )]) + 1
C2[nleft] = Cost[0,j-1] + Cost[j,nleft]
# an idiom for argmin j, min value c together
# best 3-splits of the left 3 4 5 ...
for nleft in range( 2, n ):
J3[nleft] = j = np.argmin([ C2[j-1] + Cost[j,nleft] for j in range( 2, nleft+1 )]) + 2
C3[nleft] = C2[j-1] + Cost[j,nleft]
# best 4-split of all n --
j4 = np.argmin([ C3[j-1] + Cost[j,n-1] for j in range( 3, n )]) + 3
c4 = C3[j4-1] + Cost[j4,n-1]
j3 = J3[j4]
j2 = J2[j3]
jsplit = np.array([ 0, j2, j3, j4, n ])
if verbose:
print "split4: len %s pos %s cost %.3g" % (np.diff(jsplit), jsplit, c4)
print "split4: J2 %s C2 %s" %(J2, C2)
print "split4: J3 %s C3 %s" %(J3, C3)
return jsplit
#...............................................................................
if __name__ == "__main__":
import random
import sys
import spread
n = 10
ncycle = 2
plot = 0
seed = 0
# run this.py a=1 b=None c=[3] 'd = expr' ... in sh or ipython
for arg in sys.argv[1:]:
exec( arg )
np.set_printoptions( 1, threshold=100, edgeitems=10, linewidth=100, suppress=True )
np.random.seed(seed)
random.seed(seed)
print "\n", 80 * "-"
title = "Dynamic programming least-square horizontal lines %s n %d seed %d" % (
__file__, n, seed)
print title
x = np.arange( n + 0. )
y = np.sin( 2*np.pi * x * ncycle / n )
# synthetic time series ?
print "y: %s av %.3g variance %.3g" % (y, y.mean(), np.var(y))
print "Cost[j,k] = sum (y - av y)^2 --" # len * var y[j:k+1]
Cost = spread.spreads_allij( y )
print Cost # .round().astype(int)
jsplit = split4( Cost )
# split4: len [3 2 3 2] pos [ 0 3 5 8 10]
if plot:
import matplotlib.pyplot as pl
title += "\n lengths: %s" % np.diff(jsplit)
pl.title( title )
pl.plot( y )
for js, js1 in zip( jsplit[:-1], jsplit[1:] ):
if js1 <= js: continue
yav = y[js:js1].mean() * np.ones( js1 - js + 1 )
pl.plot( np.arange( js, js1 + 1 ), yav )
# pl.legend()
pl.show()
Then, the following code does Cost[]
for horizontal lines only, slope 0;
extending it to line segments of any slope, in time O(n), is left as an exercise.
""" spreads( all y[:j] ) in time O(n)
define spread( y[] ) = sum (y - average y)^2
e.g. spread of 24 hourly temperatures y[0:24] i.e. y[0] .. y[23]
around a horizontal line at the average temperature
(spread = 0 for constant temperature,
24 c^2 for constant + [c -c c -c ...],
24 * variance(y) )
How fast can one compute all 24 spreads
1 hour (midnight to 1 am), 2 hours ... all 24 ?
A simpler problem: compute all 24 averages in time O(n):
N = np.arange( 1, len(y)+1 )
allav = np.cumsum(y) / N
= [ y0, (y0 + y1) / 2, (y0 + y1 + y2) / 3 ...]
An identity:
spread(y) = sum(y^2) - n * (av y)^2
Voila: the code below, all spreads() in time O(n).
Exercise: extend this to spreads around least-squares lines
fit to [ y0, [y0 y1], [y0 y1 y2] ... ], not just horizontal lines.
"""
from __future__ import division
import sys
import numpy as np
#...............................................................................
def spreads( y ):
""" [ spread y[:1], spread y[:2] ... spread y ] in time O(n)
where spread( y[] ) = sum (y - average y )^2
= n * variance(y)
"""
N = np.arange( 1, len(y)+1 )
return np.cumsum( y**2 ) - np.cumsum( y )**2 / N
def spreads_allij( y ):
""" -> A[i,j] = sum (y - av y)^2, spread of y around its average
for all y[i:j+1]
time, space O(n^2)
"""
y = np.asanyarray( y, dtype=float )
n = len(y)
A = np.zeros((n,n))
for i in range(n):
A[i,i:] = spreads( y[i:] )
return A
So far we have an n x n cost matrix, space O(n^2).
To get down to space O( n k ),
look closely at the pattern of Cost[i,j]
accesses in the dyn-prog code:
for nleft .. to n:
Cost_nleft = Cost[j,nleft ] -- time nleft or nleft^2
for k in 3 4 5 ...:
min [ C[k-1, j-1] + Cost_nleft[j] for j .. to nleft ]
Here Cost_nleft
is one row of the full n x n cost matrix, ~ n segments, generated as needed.
This can be done in time O(n) for line segments.
But if "error for one segment through a set of n points takes O(n) time",
it seems we're up to time O(n^3). Comments anyone ?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With