Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Feeding newdata to R predict function

Tags:

r

R's predict function can take a newdata parameter and its document reads:

newdata An optional data frame in which to look for variables with which to predict. If omitted, the fitted values are used.

But I found that it is not totally true depending on how the model is fit. For instance, following code works as expected:

x <- rnorm(200, sd=10)
y <- x + rnorm(200, sd=1)
data <- data.frame(x, y)
train = sample(1:length(x), size=length(x)/2, replace=F)
dataTrain <- data[train,]
dataTest <- data[-train,]
m <- lm(y ~ x, data=dataTrain)
head(predict(m,type="response"))
head(predict(m,newdata=dataTest,type="response"))

But if the model is fit as such:

m2 <- lm(dataTrain$y ~ dataTrain$x)
head(predict(m2,type="response"))
head(predict(m2,newdata=dataTest,type="response"))

The last two line will produce exactly the same result. The predict function works in a way ignoring newdata parameter, i.e. it can't really compute the prediction on new data at all.

The culprit, of course, is lm(y ~ x, data=dataTrain) versus lm(dataTrain$y ~ dataTrain$x). But I didn't find any document that mentioned the difference between these two. Is it a known issue?

I'm using R 2.15.2.

like image 727
edwardw Avatar asked Feb 27 '13 15:02

edwardw


1 Answers

See ?predict.lm and the Note section, which I quote below:

Note:

     Variables are first looked for in ‘newdata’ and then searched for
     in the usual way (which will include the environment of the
     formula used in the fit).  A warning will be given if the
     variables found are not of the same length as those in ‘newdata’
     if it was supplied.

Whilst it doesn't state the behaviour in terms of "same name" etc, as far as the formula is concerned the terms you passed in to it were of the form foo$var and there are no such variables with names like that either in newdata or along the search path that R will traverse to look for them.

In your second case, you are totally misusing the model formula notation; the idea is to succinctly and symbolically describe the model. Succinctness and repeating the data object ad nauseum are not compatible.

The behaviour you note is exactly consistent with the documented behaviour. In simple terms, you fitted the model with terms data$x and data$y then tried to predict for terms x and y. As far as R is concerned those are different names and hence different things and it did right to not match them.

like image 200
Gavin Simpson Avatar answered Oct 27 '22 10:10

Gavin Simpson