I am interested in some disturbing behaviour of the lm
function and the associated predict.lm
function in R. The splines
base package provides the function bs
to generate b-spline expansions, which can then be used to fit a spline model using lm
, a versatile linear model fitting function.
The lm
and predict.lm
functions have a lot of built-in convenience that take advantage of formulas and terms. If the call to bs()
is nested inside the lm
call, then the user can provide univariate data to predict
, and this data will automatically be expanded into the appropriate b-spline basis. This expanded matrix of data will then predicted upon as usual.
library(splines)
x <- sort(runif(50, 0, 10))
y <- x^2
splineModel <- lm(y ~ bs(x, y, degree = 3, knots = c(3, 6)))
newData <- data.frame(x = 4)
prediction <- predict(splineModel, newData) # 16
plot(x, y)
lines(x, splineModel$fitted.values, col = 'blue3')
points(newData$x, prediction, pch = 3, cex = 3, col = 'red3')
legend("topleft", legend = c("Data", "Fitted Values", "Predicted Value"),
pch = c(1, NA, 3), col = c('black', 'blue3', 'red3'), lty = c(NA, 1, NA))
As we see, this works perfectly:
The strangeness happens when one uses the ::
operator to explicitly indicate that the bs
function is exported from the namespace of the splines
package. The following code snippet is identical except for that change:
library(splines)
x <- sort(runif(50, 0, 10))
y <- x^2
splineModel <- lm(y ~ splines::bs(x, y, degree = 3, knots = c(3, 6)))
newData <- data.frame(x = 4)
prediction <- predict(splineModel, newData) # 6.40171
plot(x, y)
lines(x, splineModel$fitted.values, col = 'blue3')
points(newData$x, prediction, pch = 3, cex = 3, col = 'red3')
legend("topleft", legend = c("Data", "Fitted Values", "Predicted Value"),
pch = c(1, NA, 3), col = c('black', 'blue3', 'red3'), lty = c(NA, 1, NA))
The exact same results are produced in the second snippet if the splines
package is never attached using library
in the first place. I cannot think of another situation in which the use of the ::
operator on an already-loaded package changes program behaviour.
The same behaviour arises using other functions from splines
like the natural spline basis implementation ns
. Interestingly, in both cases the "y hat" or fitted values are reasonable and match one another. The fitted model objects are identical except for names of attributes, as far as I can tell.
I have been unable to pin down the source of this behaviour. While this may read like a bug report, my questions are
predict.lm
but cannot pin down where the divergence occurs.So the problem is that the model needs to keep track of the knots that were calculated with the original data and use those values when predicting new data. That typically happens in the model.frame()
call inside the lm()
call. The bs()
function returns a class of "bs"
and when making the model.frame, that column is dispatched to splines:::makepredictcall.bs
to try to capture the boundary knots. (You can see the makepredictcall
calls in the model.frame.default
function.)
But if we compare the results
splineModel1 <- lm(y ~ bs(x, y, degree = 3, knots = c(3, 6)))
attr(terms(splineModel1), "predvar")
# list(y, bs(x, degree = 3L, knots = c(3, 6), Boundary.knots = c(0.275912734214216,
# 9.14309860439971), intercept = FALSE))
splineModel2 <- lm(y ~ splines::bs(x, y, degree = 3, knots = c(3, 6)))
attr(terms(splineModel2), "predvar")
# list(y, splines::bs(x, y, degree = 3, knots = c(3, 6)))
Notice how the second one doesn't capture the Boundary.knots
. This is because of the splines:::makepredictcall.bs
function which actually looks at the name of the call
function (var, call) {
if (as.character(call)[1L] != "bs")
return(call)
...
}
When you use splines::bs
in the formula, then as.character(call)[1L]
returns "splines::bs"
which does not match "bs"
so nothing happens. It's unclear to me why this check is there. Seems like the method dispatching should be sufficient to assume it's a bs
object.
In my opinion this does not seem like desired behavior and probably should be fixed. But the function bs()
should not really be called without loading the package because functions like makepredictcall.bs
don't be imported either so the custom dispatching for those objects would be broken.
It seems to be related to the boundary knot values in the 'predvars' attribute of the 'terms' part of splineModel.
If we call them splineModel_1 and splineModel_2
predict(splineModel_1, newData)
16
predict(splineModel_2, newData)
6.969746
attr(splineModel_2[["terms"]], "predvars") <- attr(splineModel_1[["terms"]], "predvars")
predict(splineModel_1, newData)
16
predict(splineModel_2, newData)
16
attr(splineModel_1[["terms"]], "predvars")
list(y, bs(x, degree = 3L, knots = c(3, 6), Boundary.knots = c(0.323248628992587, 9.84225275926292), intercept = FALSE))
attr(splineModel_2[["terms"]], "predvars")
list(y, splines::bs(x, y, degree = 3, knots = c(3, 6)))
As you can see the difference is in the Boundary.knots. The only other difference is that the intercept defaults to FALSE so that's probably not relevant. The Boundary.knots are taken from the min and max of x. As for it being set by one version of bs and not another, I can only assume this is a relic in the code of lm that looks for 'bs' and not 'splines::bs' to set the Boundary.knots correctly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With