My question is about the unnecessary predictors, namely the variables that do not provide any new linear information or the variables that are linear combinations of the other predictors. As you can see the swiss
dataset has six variables.
library(swiss)
names(swiss)
# "Fertility" "Agriculture" "Examination" "Education"
# "Catholic" "Infant.Mortality"
Now I introduce a new variable ec
. It is the linear combination of Examination
and Education
.
ec <- swiss$Examination + swiss$Catholic
When we run a linear regression with unnecessary variables, R drops terms that are linear combinations of other terms and returns NA
as their coefficients. The command below illustrates the point perfectly.
lm(Fertility ~ . + ec, swiss)
Coefficients:
(Intercept) Agriculture Examination Education
66.9152 -0.1721 -0.2580 -0.8709
Catholic Infant.Mortality ec
0.1041 1.0770 NA
However, when we regress first on ec
and then all of the regressors as shown below,
lm(Fertility ~ ec + ., swiss)
Coefficients:
(Intercept) ec Agriculture Examination
66.9152 0.1041 -0.1721 -0.3621
Education Catholic Infant.Mortality
-0.8709 NA 1.0770
I would expect the coefficients of both Catholic
and Examination
to be NA
. The variable ec
is linear combination of both of them but in the end the coefficient of Examination
is not NA
whereas that of the Catholic
is NA
.
Could anyone explain the reason of that?
The first variable y is the outcome variable. The remaining variables x1-x5 are the predictors. In this Example, I’ll illustrate how to estimate and save the regression coefficients of a linear model in R.
You are getting NA for the last variable because it is linearly dependent on the other 11 variables. R's lm function (and all properly constructed R regression functions as well) will automatically exclude linearly dependent variables for you.
1 Answer. NA as a coefficient in a regression indicates that the variable in question is linearly related to the other variables. In your case, this means that for some . If this is the case, then there's no unique solution to the regression without dropping one of the variables. Adding is only going to make matters worse.
This is the regression where the output variable is a function of a multiple-input variable. In both the above cases c0, c1, c2 are the coefficient’s which represents regression weights. R is a very powerful statistical tool.
There will be
NA
?
Yes. Adding these columns does not enlarge column space. The resulting matrix is rank-deficient.
How many
NA
?
It depends on the numerical rank.
number of NA = number of coefficients - rank of model matrix
In your example, after introducing ec
, there will be one NA
. Changing the specification order for covariates in the model formula is essentially doing column shuffling for the model matrix. This does not change the matrix rank, so you will always get only one NA
regardless of your specification order.
OK, but which one is
NA
?
lm
does LINPACK QR factorization with restricted column pivoting. The order of covariates affects which one is NA
. Generally, a "first comes, first serves" principle holds, and the position of NA
is quite predictable. Take your examples for illustration. In the first specification, these co-linear terms show up in Examination
, Catholic
, ec
order, so the third one ec
has NA
coefficient. In your second specification, these terms show up in ec
, Examination
, Catholic
order, and the third one Catholic
has NA
coefficient. Note that coefficients estimation is not invariant to specification order, although fitted values are invariant.
If LAPACK QR factorization with complete column pivoting is taken, coefficients estimation would be invariant to specification order. However, the position of NA
is not as predictable as in LINPACK case, and is purely decided numerically.
LAPACK based QR factorization is implemented in mgcv
package. Numerical rank is detected when REML estimation is used, and unidentifiable coefficients are reported as 0 (not NA
). So we can make a comparison between lm
and gam
/ bam
in linear model estimation. Let's first construct a toy dataset.
set.seed(0)
# an initial full rank matrix
X <- matrix(runif(500 * 10), 500)
# make the last column as a random linear combination of previous 9 columns
X[, 10] <- X[, -10] %*% runif(9)
# a random response
Y <- rnorm(500)
Now we shuffle columns of X
to see whether NA
changes its position under lm
estimation, or whether 0 changes its position under gam
and bam
estimation.
test <- function (fun = lm, seed = 0, ...) {
shuffleFit <- function (fun) {
shuffle <- sample.int(ncol(X))
Xs <- X[, shuffle]
b <- unname(coef(fun(Y ~ Xs, ...)))
back <- order(shuffle)
c(b[1], b[-1][back])
}
set.seed(seed)
oo <- t(replicate(10, shuffleFit(fun)))
colnames(oo) <- c("intercept", paste0("X", 1:ncol(X)))
oo
}
First we check with lm
:
test(fun = lm)
We see that NA
changes its position with column shuffling of X
. Estimated coefficients vary, too.
Now we check with gam
library(mgcv)
test(fun = gam, method = "REML")
We see that estimation is invariant to column shuffling of X
, and coefficient for X5
is always 0.
Finally we check bam
(bam
is slow for small dataset like here. It is designed for large or super large dataset. So the following is noticeably slower).
test(fun = bam, gc.level = -1)
The result is as same as what we see for gam
.
ec , examination and catholic are 3 parameters in which you need at least 2 variable to determine the third. The important part is that 2 out of 3 is always required. now when you pass this to lm , the first two of the 3 related variables will get coefficient and the third one will end up with NA . The order of variables are important. I hope this explains why both examination and catholic are not NA. with just ec alone , you cannot determine both examination as well as catholic
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With