Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bizarre behaviour of lm() and predict.lm() depending on use of explicit namespace accessor

Tags:

r

spline

lm

predict

I am interested in some disturbing behaviour of the lm function and the associated predict.lm function in R. The splines base package provides the function bs to generate b-spline expansions, which can then be used to fit a spline model using lm, a versatile linear model fitting function.

The lm and predict.lm functions have a lot of built-in convenience that take advantage of formulas and terms. If the call to bs() is nested inside the lm call, then the user can provide univariate data to predict, and this data will automatically be expanded into the appropriate b-spline basis. This expanded matrix of data will then predicted upon as usual.

library(splines)

x <- sort(runif(50, 0, 10))
y <- x^2

splineModel <- lm(y ~ bs(x, y, degree = 3, knots = c(3, 6)))

newData <- data.frame(x = 4)
prediction <- predict(splineModel, newData) # 16

plot(x, y)
lines(x, splineModel$fitted.values, col = 'blue3')
points(newData$x, prediction, pch = 3, cex = 3, col = 'red3')
legend("topleft", legend = c("Data", "Fitted Values", "Predicted Value"),
       pch = c(1, NA, 3), col = c('black', 'blue3', 'red3'), lty = c(NA, 1, NA))

As we see, this works perfectly:

enter image description here

The strangeness happens when one uses the :: operator to explicitly indicate that the bs function is exported from the namespace of the splines package. The following code snippet is identical except for that change:

library(splines)

x <- sort(runif(50, 0, 10))
y <- x^2

splineModel <- lm(y ~ splines::bs(x, y, degree = 3, knots = c(3, 6)))

newData <- data.frame(x = 4) 
prediction <- predict(splineModel, newData) # 6.40171

plot(x, y)
lines(x, splineModel$fitted.values, col = 'blue3')
points(newData$x, prediction, pch = 3, cex = 3, col = 'red3')
legend("topleft", legend = c("Data", "Fitted Values", "Predicted Value"),
       pch = c(1, NA, 3), col = c('black', 'blue3', 'red3'), lty = c(NA, 1, NA))

enter image description here

The exact same results are produced in the second snippet if the splines package is never attached using library in the first place. I cannot think of another situation in which the use of the :: operator on an already-loaded package changes program behaviour.

The same behaviour arises using other functions from splines like the natural spline basis implementation ns. Interestingly, in both cases the "y hat" or fitted values are reasonable and match one another. The fitted model objects are identical except for names of attributes, as far as I can tell.

I have been unable to pin down the source of this behaviour. While this may read like a bug report, my questions are

  1. Why does this happen? I have been trying to follow through predict.lm but cannot pin down where the divergence occurs.
  2. Is this somehow intended behaviour, and if so where can I learn more about it?
like image 893
mb7744 Avatar asked Apr 19 '17 20:04

mb7744


2 Answers

So the problem is that the model needs to keep track of the knots that were calculated with the original data and use those values when predicting new data. That typically happens in the model.frame() call inside the lm() call. The bs() function returns a class of "bs" and when making the model.frame, that column is dispatched to splines:::makepredictcall.bs to try to capture the boundary knots. (You can see the makepredictcall calls in the model.frame.default function.)

But if we compare the results

splineModel1 <- lm(y ~ bs(x, y, degree = 3, knots = c(3, 6)))
attr(terms(splineModel1), "predvar")
# list(y, bs(x, degree = 3L, knots = c(3, 6), Boundary.knots =  c(0.275912734214216, 
# 9.14309860439971), intercept = FALSE))

splineModel2 <- lm(y ~ splines::bs(x, y, degree = 3, knots = c(3, 6)))
attr(terms(splineModel2), "predvar")
# list(y, splines::bs(x, y, degree = 3, knots = c(3, 6)))

Notice how the second one doesn't capture the Boundary.knots. This is because of the splines:::makepredictcall.bs function which actually looks at the name of the call

function (var, call) {
    if (as.character(call)[1L] != "bs") 
        return(call)
    ...
}

When you use splines::bs in the formula, then as.character(call)[1L] returns "splines::bs" which does not match "bs" so nothing happens. It's unclear to me why this check is there. Seems like the method dispatching should be sufficient to assume it's a bs object.

In my opinion this does not seem like desired behavior and probably should be fixed. But the function bs() should not really be called without loading the package because functions like makepredictcall.bs don't be imported either so the custom dispatching for those objects would be broken.

like image 175
MrFlick Avatar answered Oct 24 '22 08:10

MrFlick


It seems to be related to the boundary knot values in the 'predvars' attribute of the 'terms' part of splineModel.

If we call them splineModel_1 and splineModel_2

predict(splineModel_1, newData)
16
predict(splineModel_2, newData)
6.969746

attr(splineModel_2[["terms"]], "predvars") <- attr(splineModel_1[["terms"]], "predvars")

predict(splineModel_1, newData)
16
predict(splineModel_2, newData)
16

attr(splineModel_1[["terms"]], "predvars")
list(y, bs(x, degree = 3L, knots = c(3, 6), Boundary.knots = c(0.323248628992587, 9.84225275926292), intercept = FALSE))

attr(splineModel_2[["terms"]], "predvars")
list(y, splines::bs(x, y, degree = 3, knots = c(3, 6)))

As you can see the difference is in the Boundary.knots. The only other difference is that the intercept defaults to FALSE so that's probably not relevant. The Boundary.knots are taken from the min and max of x. As for it being set by one version of bs and not another, I can only assume this is a relic in the code of lm that looks for 'bs' and not 'splines::bs' to set the Boundary.knots correctly.

like image 40
Gladwell Avatar answered Oct 24 '22 08:10

Gladwell