R's predict
function can take a newdata
parameter and its document reads:
newdata An optional data frame in which to look for variables with which to predict. If omitted, the fitted values are used.
But I found that it is not totally true depending on how the model is fit. For instance, following code works as expected:
x <- rnorm(200, sd=10)
y <- x + rnorm(200, sd=1)
data <- data.frame(x, y)
train = sample(1:length(x), size=length(x)/2, replace=F)
dataTrain <- data[train,]
dataTest <- data[-train,]
m <- lm(y ~ x, data=dataTrain)
head(predict(m,type="response"))
head(predict(m,newdata=dataTest,type="response"))
But if the model is fit as such:
m2 <- lm(dataTrain$y ~ dataTrain$x)
head(predict(m2,type="response"))
head(predict(m2,newdata=dataTest,type="response"))
The last two line will produce exactly the same result. The predict
function works in a way ignoring newdata
parameter, i.e. it can't really compute the prediction on new data at all.
The culprit, of course, is lm(y ~ x, data=dataTrain)
versus lm(dataTrain$y ~ dataTrain$x)
. But I didn't find any document that mentioned the difference between these two. Is it a known issue?
I'm using R 2.15.2.
See ?predict.lm
and the Note section, which I quote below:
Note:
Variables are first looked for in ‘newdata’ and then searched for
in the usual way (which will include the environment of the
formula used in the fit). A warning will be given if the
variables found are not of the same length as those in ‘newdata’
if it was supplied.
Whilst it doesn't state the behaviour in terms of "same name" etc, as far as the formula is concerned the terms you passed in to it were of the form foo$var
and there are no such variables with names like that either in newdata
or along the search path that R will traverse to look for them.
In your second case, you are totally misusing the model formula notation; the idea is to succinctly and symbolically describe the model. Succinctness and repeating the data object ad nauseum are not compatible.
The behaviour you note is exactly consistent with the documented behaviour. In simple terms, you fitted the model with terms data$x
and data$y
then tried to predict for terms x
and y
. As far as R is concerned those are different names and hence different things and it did right to not match them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With