My question: what is the most efficient way of removing a predictor with NAs and consider the complete cases excluding that predictor?
The question arises from the following regression situation with NAs, in which there are missing values in Ozone (mostly) and Solar.R.
data(airquality)
summary(airquality)
# Ozone Solar.R Wind Temp Month
# Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 Min. :5.000
# 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.:6.000
# Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Median :7.000
# Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 Mean :6.993
# 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.:8.000
# Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.000
# NA's :37 NA's :7
# Day
# Min. : 1.0
# 1st Qu.: 8.0
# Median :16.0
# Mean :15.8
# 3rd Qu.:23.0
# Max. :31.0
Regression of Wind on the remaining variables. Considers only the complete cases.
summary(lm(Wind ~ ., data = airquality))
#
# Call:
# lm(formula = Wind ~ ., data = airquality)
#
# Residuals:
# Min 1Q Median 3Q Max
# -4.3908 -2.2800 -0.3078 1.4132 9.6501
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 15.519460 2.918393 5.318 5.96e-07 ***
# Ozone -0.060746 0.011798 -5.149 1.23e-06 ***
# Solar.R 0.003791 0.003216 1.179 0.241
# Temp -0.036604 0.044576 -0.821 0.413
# Month -0.159671 0.208082 -0.767 0.445
# Day 0.017353 0.031238 0.556 0.580
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 2.822 on 105 degrees of freedom
# (42 observations deleted due to missingness)
# Multiple R-squared: 0.3994, Adjusted R-squared: 0.3708
# F-statistic: 13.96 on 5 and 105 DF, p-value: 1.857e-10
If Ozone is removed, still considers only the complete cases (with Ozone included). But this is different from manually removing Ozone.
summary(lm(Wind ~ . - Ozone, data = airquality))
#
# Call:
# lm(formula = Wind ~ . - Ozone, data = airquality)
#
# Residuals:
# Min 1Q Median 3Q Max
# -6.012 -2.323 -0.361 1.493 9.605
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 24.3159074 2.6354288 9.227 3.09e-15 ***
# Solar.R 0.0009228 0.0035281 0.262 0.794
# Temp -0.1900820 0.0369159 -5.149 1.21e-06 ***
# Month 0.0313046 0.2280600 0.137 0.891
# Day 0.0008969 0.0346116 0.026 0.979
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 3.143 on 106 degrees of freedom
# (42 observations deleted due to missingness)
# Multiple R-squared: 0.2477, Adjusted R-squared: 0.2193
# F-statistic: 8.727 on 4 and 106 DF, p-value: 3.961e-06
summary(lm(Wind ~ Solar.R + Temp + Wind + Month + Day, data = airquality))
#
# Call:
# lm(formula = Wind ~ Solar.R + Temp + Wind + Month + Day, data = airquality)
#
# Residuals:
# Min 1Q Median 3Q Max
# -8.1779 -2.2063 -0.2757 1.9448 9.3510
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 23.660271 2.416766 9.790 < 2e-16 ***
# Solar.R 0.002980 0.003113 0.957 0.340
# Temp -0.186386 0.032725 -5.695 6.89e-08 ***
# Month 0.074952 0.206334 0.363 0.717
# Day -0.011028 0.030304 -0.364 0.716
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 3.158 on 141 degrees of freedom
# (7 observations deleted due to missingness)
# Multiple R-squared: 0.2125, Adjusted R-squared: 0.1901
# F-statistic: 9.511 on 4 and 141 DF, p-value: 7.761e-07
It is indeed unfortunate and surprising that Wind ~ . - Ozone considers Ozone when finding complete cases; seems worth discussion on the [email protected] mailing list, if you want to pursue it. In the meantime, how about
summary(lm(Wind ~ ., data = subset(airquality, select=-Ozone))
?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With