Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

predict.lm with newdata

Tags:

r

I've built an lm model without using the data= parameter:

m1 <- lm( mdldvlp.trim$y ~  gc.pc$scores[,1] + gc.pc$scores[,2] + gc.pc$scores[,3] + 
                            gc.pc$scores[,4] + gc.pc$scores[,5] + gc.pc$scores[,6] + predict(gc.tA))

Now I'd like to predict m1 using newdata and so name my new data.frame to match the variables used in the lm() call above.

With newComps as my new gc.pc (which, like the gc.tA prediction, were predicted using the new data.frame without any issues), I've tried

newD <- data.frame( newComps[1:100,1:6] ,
                    predict(gc.tA , newdata = mdldvlp[1:100,predKept]))


names(newD) <- names(m1$coefficients)[-1]
names(newD) <- names(m1$model)[-1]

names(newD) <- c( "gc.pc$scores[, 1]" , "gc.pc$scores[, 2]" , "gc.pc$scores[, 3]" , 
                  "gc.pc$scores[, 4]" , "gc.pc$scores[, 5]" , "gc.pc$scores[, 6]" , 
                  "predict(gc.tA)" )
names(newD) <- c( "gc.pc$scores[,1]" , "gc.pc$scores[,2]" , "gc.pc$scores[,3]" , 
                  "gc.pc$scores[,4]" , "gc.pc$scores[,5]" , "gc.pc$scores[,6]" , 
                  "predict(gc.tA)" )

Unfortunately, predict.lm does not accept the naming strategies above and returns the dreaded newdata warning along with the predictions from the original data.frame that built m1:

Warning message:
'newdata' had 100 rows but variable(s) found have 1414 rows  

How should I name the newD columns to make the predict call work? Thanks.

The code below recreates the issue:

    require(rpart)

    set.seed(123)
    X <- matrix(runif(200) , 20 , 10)
    gc.pc <- princomp(X)
    y <- runif(20)
    mdldvlp.trim <- data.frame(y,X)
    names(mdldvlp.trim) <- c("y",paste("x",1:10,sep=""))
    predKept <- paste("x",1:10,sep="")

    gc.tA <- rpart( y ~ . , data = mdldvlp.trim)

    m1 <- lm( mdldvlp.trim$y ~  gc.pc$scores[,1] + gc.pc$scores[,2] + gc.pc$scores[,3] + 
                                gc.pc$scores[,4] + gc.pc$scores[,5] + gc.pc$scores[,6] + predict(gc.tA))

    mdldvlp <- data.frame(matrix(runif(2000) , 200 , 10))
    names(mdldvlp) <- predKept

    newComps <- predict( gc.pc , newdata=mdldvlp )

    newD <- data.frame( newComps[1:100,1:6] ,
                        predict(gc.tA , newdata = mdldvlp[1:100,predKept]))

# enter newD naming strategy here

    predict( m1 , newdata=newD )

4/20 Follow up:

Thanks all for your answers. I understand things would be easier by first creating a data.frame with properly named predictors. I understand that. My question is if the modeling data frame does indeed evaluate to a data frame with variables named gc.pc$scores[,1] etc. then why won't the naming 'strategies' used above work with predict.lm? In other words, does lm really evaluate its modeling data frame with gc.pc$scores[,1] and so on? If it did, wouldn't the renaming strategies above work in predict.lm?

like image 649
M.Dimo Avatar asked Apr 20 '12 03:04

M.Dimo


2 Answers

You are abusing the formula notation and it is this that is causing you problems. Essentially your formula:

m1 <- lm( mdldvlp.trim$y ~  gc.pc$scores[,1] + gc.pc$scores[,2] + 
                            gc.pc$scores[,3] + gc.pc$scores[,4] + 
                            gc.pc$scores[,5] + gc.pc$scores[,6] + 
                            predict(gc.tA))

will evaluate to a data frame with variables named gc.pc$scores[,1] etc. When you use predict() it will look for variables with these same names in the object passed to the newdata argument.

Ideally, you'd create a data object with all the variables you want included in them with appropriate names, eg:

fitData <- data.frame(mdldvlp.trim$y, gc.pc$scores[, 1:6], predict(gc.tA))
names(fitData) <- c("trimY", paste("scores", 1:6, sep = ""), "preds")

and then fit the model via:

m1 <- lm(trimY ~ ., data = fitData)

New predictions can be made from the model by providing a data frame with the same names as used to fit the model. Hence using your newD:

newD <- data.frame(newComps[1:100,1:6] ,
                   predict(gc.tA , newdata = mdldvlp[1:100,predKept]))
names(newD) <- c(paste("scores", 1:6, sep = ""), "preds")

and then predict()

predict(m1 , newdata=newD)

Full example

require(rpart)

set.seed(123)
X <- matrix(runif(200) , 20 , 10)
gc.pc <- princomp(X)
y <- runif(20)
mdldvlp.trim <- data.frame(y,X)
names(mdldvlp.trim) <- c("y",paste("x",1:10,sep=""))
predKept <- paste("x",1:10,sep="")

gc.tA <- rpart( y ~ . , data = mdldvlp.trim)
fitData <- data.frame(mdldvlp.trim$y, gc.pc$scores[, 1:6], predict(gc.tA))
names(fitData) <- c("trimY", paste("scores", 1:6, sep = ""), "preds")
m1 <- lm(trimY ~ ., data = fitData)
mdldvlp <- data.frame(matrix(runif(2000) , 200 , 10))
names(mdldvlp) <- predKept

newComps <- predict( gc.pc , newdata=mdldvlp )
newD <- data.frame(newComps[1:100,1:6] ,
                   predict(gc.tA , newdata = mdldvlp[1:100,predKept]))
names(newD) <- c(paste("scores", 1:6, sep = ""), "preds")
predict(m1 , newdata=newD)
like image 102
Gavin Simpson Avatar answered Oct 04 '22 22:10

Gavin Simpson


I've had a similar issue in the past - I think I resolved it by giving my variables names instead of referring to a column number. e.g. Don't use gc.pc[,1], but convert the gc.pc matrix to a dataframe and add names to the columns ("PC1", "PC2", ... etc.). Then make sure that your newdata also uses these names (in a dataframe as well).

like image 39
Marc in the box Avatar answered Oct 04 '22 21:10

Marc in the box