So I am having some issues with some NA
values in the residuals of a lm
cross sectional regression in R.
The issue isn't the NA
values themselves, it's the way R presents them.
For example:
test$residuals
# 1 2 4 5
# 0.2757677 -0.5772193 -5.3061303 4.5102816
test$residuals[3]
# 4
# -5.30613
In this simple example a NA
value will make one of the residuals go missing. When I extract the residuals I can clearly see the third index missing. So far so good, no complaints here. The problem is that the corresponding numeric vector is now one item shorter so the third index is actually the fourth. How can I make R return these residuals instead, i.e., explicitly showing NA
instead of skipping an index?
test$residuals
# 1 2 3 4 5
# 0.2757677 -0.5772193 NA -5.3061303 4.5102816
I need to keep track of all individual residuals so it would make my life much easier if I could extract them this way instead.
The na. omit() function returns a list without any rows that contain na values. It will drop rows with na value / nan values. This is the fastest way to remove na rows in the R programming language.
First, if we want to exclude missing values from mathematical operations use the na. rm = TRUE argument. If you do not exclude these values most functions will return an NA . We may also desire to subset our data to obtain complete observations, those observations (rows) in our data that contain no missing data.
Yet another idea is to take advantage of the row names associated with the data frame provided as input to lm
. In that case, the residuals should retain the names from the source data. Accessing the residuals from your example would give a value of -5.3061303 for test$residuals["4"]
and NA for test$residuals["3"]
.
However, this does not exactly answer your question. One approach to doing exactly what you asked for in terms of getting the NA values back into the residuals is illustrated below:
> D<-data.frame(x=c(NA,2,3,4,5,6),y=c(2.1,3.2,4.9,5,6,7),residual=NA)
> Z<-lm(y~x,data=D)
> D[names(Z$residuals),"residual"]<-Z$residuals
> D
x y residual
1 NA 2.1 NA
2 2 3.2 -0.28
3 3 4.9 0.55
4 4 5.0 -0.22
5 5 6.0 -0.09
6 6 7.0 0.04
If you are doing predictions based on the regression results, you may want to specify na.action=na.exclude
in lm
. See the help results for na.omit
for a discussion. Note that simply specifying na.exclude
does not actually put the NA values back into the residuals vector itself.
As noted in a prior answer, resid
(synonym for residuals
) provides a generic access function in which the residuals will contain the desired NA values if na.exclude
was specified in lm
. Using resid
is probably more general and a cleaner approach. In that case, the code for the above example would be changed to:
> D<-data.frame(x=c(NA,2,3,4,5,6),y=c(2.1,3.2,4.9,5,6,7),residual=NA)
> Z<-lm(y~x,data=D,na.action=na.exclude)
> D$residuals<-residuals(Z)
I just found this googling around a bit deeper. The resid
function on a lm
with na.action=na.exclude
is the way to go.
Here an illustrated strategy using a slightly modified example on the lm help page. This is a direct application of the definition of residual:
## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
# Two NA's introduced
weight <- c(4.17,5.58,NA,6.11,4.50,4.61,5.17,4.53,5.33,5.14,
4.81,4.17,4.41,3.59,5.87,3.83,6.03,NA,4.32,4.69)
group <- gl(2,10,20, labels=c("Ctl","Trt"))
lm.D9 <- lm(weight ~ group)
rr2 <- weight- predict(lm.D9, na.action=na.pass)
Warning message:
In weight - predict(lm.D9, na.action = na.pass) :
longer object length is not a multiple of shorter object length
> rr2
[1] -0.8455556 0.5644444 NA 1.0944444 -0.5155556 -0.4055556 0.1544444
[8] -0.4855556 0.3144444 0.5044444 0.1744444 -0.4655556 -0.2255556 -1.0455556
[15] 1.2344444 -0.8055556 1.3944444 NA -0.6955556 -0.3255556
I think it would be dangerous to directly modify an lm object so that lm.D9$residual would return that result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With