Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R smooth.spline(): smoothing spline is not smooth but overfitting my data

I have several data points which seem suitable for fitting a spline through them. When I do this, I get a rather bumpy fit, like overfitting, which is not what I understand as smoothing.

fit

Is there a special option / parameter for getting back the function of a really smooth spline like here.

The usage of the penalty parameter for smooth.spline didn't have any visible effect. Maybe I did it wrong?

Here are data and code:

results <- structure(
    list(
        beta = c(
            0.983790622281964, 0.645152464354322,
            0.924104713597375, 0.657703886566088, 0.788138034115623, 0.801080207252363,
            1, 0.858337365965949, 0.999687052533693, 0.666552625121279, 0.717453633245958,
            0.621570152961453, 0.964658181346544, 0.65071758770312, 0.788971505000918,
            0.980476054183113, 0.670263506919246, 0.600387040967624, 0.759173403408052,
            1, 0.986409675965, 0.982996471134736, 1, 0.995340781899163, 0.999855895958986,
            1, 0.846179233381267, 0.879226324448832, 0.795820998892035, 0.997586607285667,
            0.848036806290156, 0.905320944437968, 0.947709125535428, 0.592172373022407,
            0.826847031044922, 0.996916006944244, 0.785967729206612, 0.650346929853076,
            0.84206351833549, 0.999043126652724, 0.936879214753098, 0.76674066557003,
            0.591431233516217, 1, 0.999833445117791, 0.999606223666537, 0.6224971799303,
            1, 0.974537160571494, 0.966717133936379
        ), inventoryCost = c(
            1750702.95138889,
            442784.114583333, 1114717.44791667, 472669.357638889, 716895.920138889,
            735396.180555556, 3837320.74652778, 872873.4375, 2872414.93055556,
            481095.138888889, 538125.520833333, 392199.045138889, 1469500.95486111,
            459873.784722222, 656220.486111111, 1654143.83680556, 437511.458333333,
            393295.659722222, 630952.170138889, 4920958.85416667, 1723517.10069444,
            1633579.86111111, 4639909.89583333, 2167748.35069444, 3062420.65972222,
            5132702.34375, 838441.145833333, 937659.288194444, 697767.1875,
            2523016.31944444, 800903.819444444, 1054991.49305556, 1266970.92013889,
            369537.673611111, 764995.399305556, 2322879.6875, 656021.701388889,
            458403.038194444, 844133.420138889, 2430700, 1232256.68402778,
            695574.479166667, 351348.524305556, 3827440.71180556, 3687610.41666667,
            2950652.51736111, 404550.78125, 4749901.64930556, 1510481.59722222,
            1422708.07291667
        )
    ), .Names = c("beta", "inventoryCost"), class = c("data.frame")
)

plot(results$beta,results$inventoryCost)
mySpline <- smooth.spline(results$beta,results$inventoryCost, penalty=999999)
lines(mySpline$x, mySpline$y, col="red", lwd = 2)
like image 378
mondano Avatar asked May 30 '16 14:05

mondano


2 Answers

Transform your data sensibly before modelling

Based on the scale of your results$inventoryCost, log transform is appropriate. For simplicity, in the following I am using x, y. I am also reordering your data so that x is ascending:

x <- results$beta; y <- log(results$inventoryCost)
reorder <- order(x); x <- x[reorder]; y <- y[reorder]

par(mfrow = c(1,2))
plot(x, y, main = "take log transform")
hist(x, main = "x is skewed")

better

The left figure looks better? Also, it is highly recommended to further take transform for x, because it is skewed! (see right figure).

The following transform is appropriate:

x1 <- -(1-x)^(1/3)

The cubic root of (1-x) will make data more spread out around x = 1. I put an additional -1 so that there is a positively monotonic relation rather than a negative one between x and x1. Now let's check the relationship:

par(mfrow = c(1,2))
plot(x1, y, main = expression(y %~% ~ x1))
hist(x1, main = "x1 is well spread out")

much better

Fitting a spline

Now we are ready for statistical modelling. Try the following call:

fit <- smooth.spline(x1, y, nknots = 10)
pred <- stats:::predict.smooth.spline(fit, x1)$y  ## predict at all x1
## or you can simply call: pred <- predict(fit, x1)$y
plot(x1, y)  ## scatter plot
lines(x1, pred, lwd = 2, col = 2)  ## fitted spline

fit

Does it look nice? Note, that I have used nknots = 10 tells smooth.spline to place 10 interior knots (by quantile); Therefore, we are to fit a penalized regression spline rather than a smoothing spline. In fact, the smooth.spline() function almost never fit a smoothing spline, unless you put all.knots = TRUE (see later example).

I also dropped penalty = 999999, as that has nothing to do with smoothness control. If you really want to control smoothness, rather than letting smooth.spline figure out the optimal one by GCV, you should use argument df or spar. I will give example later.

To transform fit back to original scale, do:

plot(x, exp(y), main = expression(Inventory %~%~ beta))
lines(x, exp(pred), lwd = 2, col = 2)

As you can see, the fitted spline is as smooth as you had expected.

good fit

Explanation on fitted spline

Let's see the summary of your fitted spline:

> fit

Smoothing Parameter  spar= 0.4549062  lambda= 0.0008657722 (11 iterations)
Equivalent Degrees of Freedom (Df): 6.022959
Penalized Criterion: 0.08517417
GCV: 0.004288539

We used 10 knots, ending up with 6 degree of freedom, so penalization suppresses about 4 parameters. The smoothing parameter GCV has chosen, after 11 iterations, is lambda= 0.0008657722.


Why do we have to transform x to x1

Spline is penalized by 2nd derivatives, yet such penalization is on the averaged/integrated 2nd derivatives at all data points. Now, look at your data (x, y). For x before 0.98, the relationship is relatively steady; as x approaches 1, the relationship quickly goes steeper. The "change point", 0.98, has very high second derivative, much much higher than the second derivatives at other locations.

y0 <- as.numeric(tapply(y, x, mean))  ## remove tied values
x0 <- unique(x)  ## remove tied values
dy0 <- diff(y0)/diff(x0)  ## 1st order difference
ddy0 <- diff(dy0)/diff(x0[-1])  ## 2nd order difference
plot(x0[1:43], abs(ddy0), pch = 19)

2nd derivative

Look at that huge spike in 2nd order difference/derivative! Now, if we fit a spline directly, the spline curve around this change point will be heavily penalized.

bad <- smooth.spline(x, y, all.knots = TRUE)
bad.pred <- predict(bad, x)$y
plot(x, exp(y), main = expression(Inventory %~% ~ beta))
lines(x, exp(bad.pred), col = 2, lwd = 3)
abline(v = 0.98, lwd = 2, lty = 2)

bad fit

You can see clearly that the spline is having some difficulty in approximating data after x = 0.98.

There are of course some ways to achieve better approximation after this change point, for example, by manually setting smaller smoothing parameter, or higher degree of freedom. But we are going to another extreme. Remember, both penalization and degree of freedom are a global measure. Increasing model complexity will get better approximation after x = 0.98, but will also make other parts more bumpy. Now let's try a model with 45 degree of freedom:

worse <- smooth.spline(x, y, all.knots = TRUE, df = 45)
worse.pred <- predict(worse, x)$y
plot(x, exp(y), main = expression(Inventory %~% ~ beta))
lines(x, exp(worse.pred), col = 2, lwd = 2)

worse fit

As you can see, the curve is bumpy. Sure, we have overfitted our dataset of 50 data, with 45 degree of freedom.

In fact, your original misuse of smooth.spline() is doing the same thing:

> mySpline
Call:
smooth.spline(x = results$beta, y = results$inventoryCost, penalty = 999999)

Smoothing Parameter  spar= -0.8074624  lambda= 3.266077e-19 (17 iterations)
Equivalent Degrees of Freedom (Df): 45
Penalized Criterion: 5.598386
GCV: 0.03824885

Oops, 45 degree of freedom, overfitting!

like image 177
Zheyuan Li Avatar answered Oct 13 '22 00:10

Zheyuan Li


I don't think you should use / want splinefun. I would suggest fitting a GAM instead:

library(mgcv)
fit <- gam(inventoryCost ~ s(beta, bs = "cr", k = 20), data = results)
summary(fit)
gam.check(fit)
plot(fit)

plot(inventoryCost ~ beta, data = results, col = "dark red", , pch = 16)
curve(predict(fit, newdata = data.frame(beta = x)), add = TRUE, 
      from = min(results$beta), to = max(results$beta), n = 1e3, lwd = 2)

resulting plot

like image 37
Roland Avatar answered Oct 13 '22 01:10

Roland