The problem is to find the best fit of a real-valued 2D curve (given by the set of points) with a polyline consisting of two lines. A brute-force approach would be to find the "left" and "right" linear fits for each point of the curve and pick the pair with minimum error. I can calculate the two linear fits incrementally while iterating through the points of the curve, but I can't find a way to incrementally calculate the error. Thus this approach yields to a quadratic complexity. The question is if there is an algorithm that will provide sub-quadratic complexity? The second question is if there is a handy C++ library for such algorithms? <hr> EDIT For fitting with a single line, there are formulas: <pre class="prettyprint"><code>m = (Σxiyi - ΣxiΣyi/N) / (Σxi2 - (Σxi)2/N) b = Σyi/N - m * Σxi/N </code></pre> where <code>m</code> is the slope and <code>b</code> is the offset of the line. Having such a formula for the fit error would solve the problem in the best way.

Disclaimer: I don't feel like figuring out how to do this in C++, so I will use Python (numpy) notation. The concepts are completely transferable, so you should have no trouble translating back to the language of your choice. Let's say that you have a pair of arrays, <code>x</code> and <code>y</code>, containing the data points, and that <code>x</code> is monotonically increasing. Let's also say that you will always select a partition point that leaves at least two elements in each partition, so the equations are solvable. Now you can compute some relevant quantities: <pre class="prettyprint"><code>N = len(x) sum_x_left = x[0] sum_x2_left = x[0] * x[0] sum_y_left = y[0] sum_y2_left = y[0] * y[0] sum_xy_left = x[0] * y[0] sum_x_right = x[1:].sum() sum_x2_right = (x[1:] * x[1:]).sum() sum_y_right = y[1:].sum() sum_y2_right = (y[1:] * y[1:]).sum() sum_xy_right = (x[1:] * y[1:]).sum() </code></pre> The reason that we need these quantities (which are <code>O(N)</code> to initialize) is that you can use them directly to compute some well known formulae for the parameters of a linear regression. For example, the optimal <code>m</code> and <code>b</code> for <code>y = m * x + b</code> is given by <pre class="prettyprint"> μx = Σxi/N μy = Σyi/N m = Σ(xi - μx)(yi - μy) / Σ(xi - μx)2 b = μy - m * μx</pre> The sum of squared errors is given by <pre class="prettyprint"> e = Σ(yi - m * xi - b)2</pre> These can be expanded using simple algebra into the following: <pre class="prettyprint"> m = (Σxiyi - ΣxiΣyi/N) / (Σxi2 - (Σxi)2/N) b = Σyi/N - m * Σxi/N e = Σyi2 + m2 * Σxi2 + N * b2 - 2 * m * Σxiyi - 2 * b * Σyi + 2 * m * b * Σxi</pre> You can therefore loop over all the possibilities and record the minimal <code>e</code>: <pre class="prettyprint"><code>for p in range(1, N - 3): # shift sums: O(1) sum_x_left += x[p] sum_x2_left += x[p] * x[p] sum_y_left += y[p] sum_y2_left += y[p] * y[p] sum_xy_left += x[p] * y[p] sum_x_right -= x[p] sum_x2_right -= x[p] * x[p] sum_y_right -= y[p] sum_y2_right -= y[p] * y[p] sum_xy_right -= x[p] * y[p] # compute err: O(1) n_left = p + 1 slope_left = (sum_xy_left - sum_x_left * sum_y_left * n_left) / (sum_x2_left - sum_x_left * sum_x_left / n_left) intercept_left = sum_y_left / n_left - slope_left * sum_x_left / n_left err_left = sum_y2_left + slope_left * slope_left * sum_x2_left + n_left * intercept_left * intercept_left - 2 * (slope_left * sum_xy_left + intercept_left * sum_y_left - slope_left * intercept_left * sum_x_left) n_right = N - n_left slope_right = (sum_xy_right - sum_x_right * sum_y_right * n_right) / (sum_x2_right - sum_x_right * sum_x_right / n_right) intercept_right = sum_y_right / n_right - slope_right * sum_x_right / n_right err_right = sum_y2_right + slope_right * slope_right * sum_x2_right + n_right * intercept_right * intercept_right - 2 * (slope_right * sum_xy_right + intercept_right * sum_y_right - slope_right * intercept_right * sum_x_right) err = err_left + err_right if p == 1 || err < err_min err_min = err n_min_left = n_left n_min_right = n_right slope_min_left = slope_left slope_min_right = slope_right intercept_min_left = intercept_left intercept_min_right = intercept_right </code></pre> There are probably other simplifications you can make, but this is sufficient to have an <code>O(n)</code> algorithm.

Sub-quadratic algorithm for fitting a curve with two lines

Tags:

c++

algorithm

linear-regression

curve-fitting

The problem is to find the best fit of a real-valued 2D curve (given by the set of points) with a polyline consisting of two lines.

A brute-force approach would be to find the "left" and "right" linear fits for each point of the curve and pick the pair with minimum error. I can calculate the two linear fits incrementally while iterating through the points of the curve, but I can't find a way to incrementally calculate the error. Thus this approach yields to a quadratic complexity.

The question is if there is an algorithm that will provide sub-quadratic complexity?

The second question is if there is a handy C++ library for such algorithms?

EDIT For fitting with a single line, there are formulas:

m = (Σxiyi - ΣxiΣyi/N) / (Σxi2 - (Σxi)2/N)
b = Σyi/N - m * Σxi/N

where m is the slope and b is the offset of the line. Having such a formula for the fit error would solve the problem in the best way.

368

asked Jun 19 '20 21:06

Vahagn

Video Answer

1 Answers

Disclaimer: I don't feel like figuring out how to do this in C++, so I will use Python (numpy) notation. The concepts are completely transferable, so you should have no trouble translating back to the language of your choice.

Let's say that you have a pair of arrays, x and y, containing the data points, and that x is monotonically increasing. Let's also say that you will always select a partition point that leaves at least two elements in each partition, so the equations are solvable.

Now you can compute some relevant quantities:

N = len(x)

sum_x_left = x[0]
sum_x2_left = x[0] * x[0]
sum_y_left = y[0]
sum_y2_left = y[0] * y[0]
sum_xy_left = x[0] * y[0]

sum_x_right = x[1:].sum()
sum_x2_right = (x[1:] * x[1:]).sum()
sum_y_right = y[1:].sum()
sum_y2_right = (y[1:] * y[1:]).sum()
sum_xy_right = (x[1:] * y[1:]).sum()

The reason that we need these quantities (which are O(N) to initialize) is that you can use them directly to compute some well known formulae for the parameters of a linear regression. For example, the optimal m and b for y = m * x + b is given by

μ_x = Σx_i/N
μ_y = Σy_i/N
m = Σ(x_i - μ_x)(y_i - μ_y) / Σ(x_i - μ_x)²
b = μ_y - m * μ_x

The sum of squared errors is given by

e = Σ(y_i - m * x_i - b)²

These can be expanded using simple algebra into the following:

m = (Σx_iy_i - Σx_iΣy_i/N) / (Σx_i² - (Σx_i)²/N)
b = Σy_i/N - m * Σx_i/N
e = Σy_i² + m² * Σx_i² + N * b² - 2 * m * Σx_iy_i - 2 * b * Σy_i + 2 * m * b * Σx_i

You can therefore loop over all the possibilities and record the minimal e:

for p in range(1, N - 3):
    # shift sums: O(1)
    sum_x_left += x[p]
    sum_x2_left += x[p] * x[p]
    sum_y_left += y[p]
    sum_y2_left += y[p] * y[p]
    sum_xy_left += x[p] * y[p]

    sum_x_right -= x[p]
    sum_x2_right -= x[p] * x[p]
    sum_y_right -= y[p]
    sum_y2_right -= y[p] * y[p]
    sum_xy_right -= x[p] * y[p]

    # compute err: O(1)
    n_left = p + 1
    slope_left = (sum_xy_left - sum_x_left * sum_y_left * n_left) / (sum_x2_left - sum_x_left * sum_x_left / n_left)
    intercept_left = sum_y_left / n_left - slope_left * sum_x_left / n_left
    err_left = sum_y2_left + slope_left * slope_left * sum_x2_left + n_left * intercept_left * intercept_left - 2 * (slope_left * sum_xy_left + intercept_left * sum_y_left - slope_left * intercept_left * sum_x_left)

    n_right = N - n_left
    slope_right = (sum_xy_right - sum_x_right * sum_y_right * n_right) / (sum_x2_right - sum_x_right * sum_x_right / n_right)
    intercept_right = sum_y_right / n_right - slope_right * sum_x_right / n_right
    err_right = sum_y2_right + slope_right * slope_right * sum_x2_right + n_right * intercept_right * intercept_right - 2 * (slope_right * sum_xy_right + intercept_right * sum_y_right - slope_right * intercept_right * sum_x_right)

    err = err_left + err_right
    if p == 1 || err < err_min
        err_min = err
        n_min_left = n_left
        n_min_right = n_right
        slope_min_left = slope_left
        slope_min_right = slope_right
        intercept_min_left = intercept_left
        intercept_min_right = intercept_right

There are probably other simplifications you can make, but this is sufficient to have an O(n) algorithm.

160

answered Oct 14 '22 23:10

Mad Physicist

Related questions
                            
                                Free memory with explicit size
                            
                                How does memory on the heap get exhausted?
                            
                                How to generate a n-sized random float array that sums up to 0.0?
                            
                                Anaconda ImportError: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found
                            
                                Is cppreference using the term "[Object's] identity" is two different meanings for c++11 and for c++17?
                            
                                what are the overload resolution rules of list-initialization
                            
                                Error when casting temporary object to non-const reference
                            
                                Check if class is of template specialization (with template arguments like bool or int)
                            
                                How can i write my own RPC Implementation for Protocol Buffers utilizing ZeroMQ
                            
                                Compile big C++ project that uses CMake to WebAssembly
                            
                                Fast generation of random derangements
                            
                                How many const variables can I declare before running out of memory?
                            
                                Invalid data when creating mkv container with h264 stream because extradata is null
                            
                                Templated delegating copy constructor in constant expressions
                            
                                Avoiding repeated C++ virtual table lookup
                            
                                Why does template parameter unpacking sometimes not work for std::function?
                            
                                Why does sorting make this branchless code faster?
                            
                                How to initialize all elements of a two-dimensional array to a particular value?
                            
                                VSCode + cmake + windows 10 -> cmake not in path
                            
                                Get a label address out of the function scope in gcc/clang (C++)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With