I've built an lm
model without using the data=
parameter:
m1 <- lm( mdldvlp.trim$y ~ gc.pc$scores[,1] + gc.pc$scores[,2] + gc.pc$scores[,3] +
gc.pc$scores[,4] + gc.pc$scores[,5] + gc.pc$scores[,6] + predict(gc.tA))
Now I'd like to predict m1
using newdata
and so name my new data.frame to match the variables used in the lm()
call above.
With newComps
as my new gc.pc
(which, like the gc.tA
prediction, were predicted using the new data.frame without any issues), I've tried
newD <- data.frame( newComps[1:100,1:6] ,
predict(gc.tA , newdata = mdldvlp[1:100,predKept]))
names(newD) <- names(m1$coefficients)[-1]
names(newD) <- names(m1$model)[-1]
names(newD) <- c( "gc.pc$scores[, 1]" , "gc.pc$scores[, 2]" , "gc.pc$scores[, 3]" ,
"gc.pc$scores[, 4]" , "gc.pc$scores[, 5]" , "gc.pc$scores[, 6]" ,
"predict(gc.tA)" )
names(newD) <- c( "gc.pc$scores[,1]" , "gc.pc$scores[,2]" , "gc.pc$scores[,3]" ,
"gc.pc$scores[,4]" , "gc.pc$scores[,5]" , "gc.pc$scores[,6]" ,
"predict(gc.tA)" )
Unfortunately, predict.lm
does not accept the naming strategies above and returns the dreaded newdata
warning along with the predictions from the original data.frame that built m1
:
Warning message:
'newdata' had 100 rows but variable(s) found have 1414 rows
How should I name the newD
columns to make the predict
call work? Thanks.
The code below recreates the issue:
require(rpart)
set.seed(123)
X <- matrix(runif(200) , 20 , 10)
gc.pc <- princomp(X)
y <- runif(20)
mdldvlp.trim <- data.frame(y,X)
names(mdldvlp.trim) <- c("y",paste("x",1:10,sep=""))
predKept <- paste("x",1:10,sep="")
gc.tA <- rpart( y ~ . , data = mdldvlp.trim)
m1 <- lm( mdldvlp.trim$y ~ gc.pc$scores[,1] + gc.pc$scores[,2] + gc.pc$scores[,3] +
gc.pc$scores[,4] + gc.pc$scores[,5] + gc.pc$scores[,6] + predict(gc.tA))
mdldvlp <- data.frame(matrix(runif(2000) , 200 , 10))
names(mdldvlp) <- predKept
newComps <- predict( gc.pc , newdata=mdldvlp )
newD <- data.frame( newComps[1:100,1:6] ,
predict(gc.tA , newdata = mdldvlp[1:100,predKept]))
# enter newD naming strategy here
predict( m1 , newdata=newD )
Thanks all for your answers. I understand things would be easier by first creating a data.frame with properly named predictors. I understand that. My question is if the modeling data frame does indeed evaluate to a data frame with variables named gc.pc$scores[,1]
etc. then why won't the naming 'strategies' used above work with predict.lm
? In other words, does lm
really evaluate its modeling data frame with gc.pc$scores[,1]
and so on? If it did, wouldn't the renaming strategies above work in predict.lm
?
You are abusing the formula notation and it is this that is causing you problems. Essentially your formula:
m1 <- lm( mdldvlp.trim$y ~ gc.pc$scores[,1] + gc.pc$scores[,2] +
gc.pc$scores[,3] + gc.pc$scores[,4] +
gc.pc$scores[,5] + gc.pc$scores[,6] +
predict(gc.tA))
will evaluate to a data frame with variables named gc.pc$scores[,1]
etc. When you use predict()
it will look for variables with these same names in the object passed to the newdata
argument.
Ideally, you'd create a data object with all the variables you want included in them with appropriate names, eg:
fitData <- data.frame(mdldvlp.trim$y, gc.pc$scores[, 1:6], predict(gc.tA))
names(fitData) <- c("trimY", paste("scores", 1:6, sep = ""), "preds")
and then fit the model via:
m1 <- lm(trimY ~ ., data = fitData)
New predictions can be made from the model by providing a data frame with the same names as used to fit the model. Hence using your newD
:
newD <- data.frame(newComps[1:100,1:6] ,
predict(gc.tA , newdata = mdldvlp[1:100,predKept]))
names(newD) <- c(paste("scores", 1:6, sep = ""), "preds")
and then predict()
predict(m1 , newdata=newD)
require(rpart)
set.seed(123)
X <- matrix(runif(200) , 20 , 10)
gc.pc <- princomp(X)
y <- runif(20)
mdldvlp.trim <- data.frame(y,X)
names(mdldvlp.trim) <- c("y",paste("x",1:10,sep=""))
predKept <- paste("x",1:10,sep="")
gc.tA <- rpart( y ~ . , data = mdldvlp.trim)
fitData <- data.frame(mdldvlp.trim$y, gc.pc$scores[, 1:6], predict(gc.tA))
names(fitData) <- c("trimY", paste("scores", 1:6, sep = ""), "preds")
m1 <- lm(trimY ~ ., data = fitData)
mdldvlp <- data.frame(matrix(runif(2000) , 200 , 10))
names(mdldvlp) <- predKept
newComps <- predict( gc.pc , newdata=mdldvlp )
newD <- data.frame(newComps[1:100,1:6] ,
predict(gc.tA , newdata = mdldvlp[1:100,predKept]))
names(newD) <- c(paste("scores", 1:6, sep = ""), "preds")
predict(m1 , newdata=newD)
I've had a similar issue in the past - I think I resolved it by giving my variables names instead of referring to a column number. e.g. Don't use gc.pc[,1], but convert the gc.pc matrix to a dataframe and add names to the columns ("PC1", "PC2", ... etc.). Then make sure that your newdata also uses these names (in a dataframe as well).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With