Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

predict.lm after regression with missing data in Y

I don't understand how to generate predicted values from a linear regression using the predict.lm command when some value of the dependent variable Y are missing, even though no independent X observation is missing. Algebraically, this isn't a problem, but I don't know an efficient method to do it in R. Take for example this fake dataframe and regression model. I attempt to assign predictions in the source dataframe but am unable to do so because of one missing Y value: I get an error.

# Create a fake dataframe
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(100,200,300,400,NA,600,700,800,900,100)
df <- as.data.frame(cbind(x,y))

# Regress X and Y
model<-lm(y~x+1)
summary(model)

# Attempt to generate predictions in source dataframe but am unable to.
df$y_ip<-predict.lm(testy)

Error in `$<-.data.frame`(`*tmp*`, y_ip, value = c(221.............
  replacement has 9 rows, data has 10

I got around this problem by generating the predictions using algebra, df$y<-B0+ B1*df$x, or generating the predictions by calling the coefficients of the model df$y<-((summary(model)$coefficients[1, 1]) + (summary(model)$coefficients[2, 1]*(df$x)) ; however, I am now working with a big data model with hundreds of coefficients, and these methods are no longer practical. I'd like to know how to do it using the predict function.

Thank you in advance for your assistance!

like image 557
Aron Avatar asked Jan 30 '23 02:01

Aron


1 Answers

There is built-in functionality for this in R (but not necessarily obvious): it's the na.action argument/?na.exclude function. With this option set, predict() (and similar downstream processing functions) will automatically fill in NA values in the relevant spots.

Set up data:

df <- data.frame(x=1:10,y=100*(1:10))
df$y[5] <- NA

Fit model: default na.action is na.omit, which simply removes non-complete cases.

mod1 <- lm(y~x+1,data=df)
predict(mod1)
##    1    2    3    4    6    7    8    9   10 
##  100  200  300  400  600  700  800  900 1000 

na.exclude removes non-complete cases before fitting, but then restores them (filled with NA) in predicted vectors:

mod2 <- update(mod1,na.action=na.exclude)
predict(mod2)
##    1    2    3    4    5    6    7    8    9   10 
##  100  200  300  400   NA  600  700  800  900 1000 
like image 193
Ben Bolker Avatar answered Feb 03 '23 03:02

Ben Bolker