Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fitting a curve "around" data points in R

Tags:

optimization

r

i have a data set consisting of a collection of points. the points are distributed on the plane in such a way that they can be roughly bounded by a parabola. i am trying to find a way of fitting a parabola to the boundary of the points.

this is what i have at present:

a = 1
b = 2
c = 3

parabola <- function(x) {
    a * x^2 + b * x + c
}

N = 10000

x <- runif(N, -4, 3)
y <- runif(N, 0, 10)

data <- data.frame(x, y)

data <- subset(data, y >= parabola(x))

plot(data, xlim = c(-5, 5), ylim = c(0, 10), col = "grey")

fr <- function(x) {
    PAR = x[1] * data$x^2 + x[2] * data$x + x[3]
    #
    sum((PAR - data$y)^2 + 100 * plogis(PAR - data$y, scale = 0.00001))
}

par = optim(c(0, 0, 0), fr)$par

a = par[1]
b = par[2]
c = par[3]

curve(parabola, add = TRUE, lty = "dashed")

this creates a sample data set and then fits a curve to the boundary. the objective function consists of a "normal" squared error term, which fits a parabola to the data as well as a second logistic term which penalizes points which live below the parabola. the parameters (100 and 0.00001) of this second term were determined by trial and error.

the code plots the points as well as the fitted parabola.

now this system works... but only some of the time. sometimes it produces a fit that is completely wrong and i guess that in these instances the parameters for the logistic term are just inappropriate. run the code a few times to see what i mean.

i am sure that there must be a more robust way of solving this problem. ideas and suggestions?

.

like image 893
datawookie Avatar asked Nov 23 '12 06:11

datawookie


1 Answers

I can not provide a complete answer. The only ad hoc idea I had was to provide better starting points for the optimization algorithm - hoping you are closer to the local minimum of the function you try to optimize.

Estimating a rough first version is fairly simple. If you write your parabola as b*(x-a)^2+c you can estimate

a <- data$x[which.min(data$y)]
c <- min(data$y)
 
b1 <- (data$y[which.min(data$x)] - c) / (min(data$x) - a)^2
b2 <- (data$y[which.max(data$x)] - c) / (max(data$x) - a)^2
b <- mean(c(b1, b2))

EDIT

I had another intensive testing session with my suggestion and the method "BFGS". I could not find a counter example with the following approach:

seed <- floor(runif(1,1,1000))
set.seed(seed)
a = 1
b = 2
c = 3

parabola <- function(x) {
    b * (x-a)^2 + c
}

N = 10000

x <- runif(N, -4, 3)
y <- runif(N, 0, 10)

data <- data.frame(x, y)

data <- subset(data, y >= parabola(x))

plot(data, xlim = c(-5, 5), ylim = c(0, 10), col = "grey")

fr <- function(x) {
    PAR = x[2] * (data$x - x[1])^2 + x[3]
    #
    sum((PAR - data$y)^2 + 100 * plogis(PAR - data$y, scale = 0.00001))
}

a <- data$x[which.min(data$y)]
c <- min(data$y)

b1 <- (data$y[which.min(data$x)] - c) / (min(data$x) - a)^2
b2 <- (data$y[which.max(data$x)] - c) / (max(data$x) - a)^2
b <- mean(c(b1, b2))

par = optim(c(a, b, c), fr, method="BFGS")$par

a = par[1]
b = par[2]
c = par[3]

curve(parabola, add = TRUE, lty = "dashed")

However, correct convergence is not guaranteed. I tried about 50 cases and all went well. Is your result reviewed or does it have to work correctly on an automated basis?


EDIT 2

I had a few thoughts on how you could update your objective function to be more reliable. Right now I do not have the time to work out a complete solution, but maybe this thoughts may help you:

We have date within range(data$x). Now we want to find a parabola that fits the lower boundary of this data as good as possible - or, in other words, find values a,b,c that maximize

\int_{\range(x)} ax^2 + bx+c dx

(Please excuse the clumsy LaTeX - writing formulas sometimes just is better).

Now, penalizing points below the parabola could be done using a penalty function like

\lambda (ax_i^2+bx_i+c - y_i)^2 if below parabola, 0 otherwise

Subtracting that function from the interval should give you a suitable, smooth objective function. Simplifying the function as good a possible seems to be a better model than using the least squares approach, which tries to fit a line through the middle of the datapoints.

You will still have to chosse a suitable lambda, though. But that is typical: You need a compromise between two different objectives (fitting data, maximizing the parabola). The weight which one is more important has to be submitted by you.

like image 138
Thilo Avatar answered Oct 09 '22 07:10

Thilo