My question is about the unnecessary predictors, namely the variables that do not provide any new linear information or the variables that are linear combinations of the other predictors. As you can see the <code>swiss</code> dataset has six variables. <pre class="prettyprint"><code>library(swiss) names(swiss) # "Fertility" "Agriculture" "Examination" "Education" # "Catholic" "Infant.Mortality" </code></pre> Now I introduce a new variable <code>ec</code>. It is the linear combination of <code>Examination</code> and <code>Education</code>. <pre class="prettyprint"><code>ec <- swiss$Examination + swiss$Catholic </code></pre> When we run a linear regression with unnecessary variables, R drops terms that are linear combinations of other terms and returns <code>NA</code> as their coefficients. The command below illustrates the point perfectly. <pre class="prettyprint"><code>lm(Fertility ~ . + ec, swiss) Coefficients: (Intercept) Agriculture Examination Education 66.9152 -0.1721 -0.2580 -0.8709 Catholic Infant.Mortality ec 0.1041 1.0770 NA </code></pre> However, when we regress first on <code>ec</code> and then all of the regressors as shown below, <pre class="prettyprint"><code>lm(Fertility ~ ec + ., swiss) Coefficients: (Intercept) ec Agriculture Examination 66.9152 0.1041 -0.1721 -0.3621 Education Catholic Infant.Mortality -0.8709 NA 1.0770 </code></pre> I would expect the coefficients of both <code>Catholic</code> and <code>Examination</code> to be <code>NA</code>. The variable <code>ec</code> is linear combination of both of them but in the end the coefficient of <code>Examination</code> is not <code>NA</code> whereas that of the <code>Catholic</code> is <code>NA</code>. Could anyone explain the reason of that?

<blockquote> There will be <code>NA</code>? </blockquote> Yes. Adding these columns does not enlarge column space. The resulting matrix is rank-deficient. <blockquote> How many <code>NA</code>? </blockquote> It depends on the numerical rank. <pre class="prettyprint"><code>number of NA = number of coefficients - rank of model matrix </code></pre> In your example, after introducing <code>ec</code>, there will be one <code>NA</code>. Changing the specification order for covariates in the model formula is essentially doing column shuffling for the model matrix. This does not change the matrix rank, so you will always get only one <code>NA</code> regardless of your specification order. <blockquote> OK, but which one is <code>NA</code>? </blockquote> <code>lm</code> does LINPACK QR factorization with restricted column pivoting. The order of covariates affects which one is <code>NA</code>. Generally, a "first comes, first serves" principle holds, and the position of <code>NA</code> is quite predictable. Take your examples for illustration. In the first specification, these co-linear terms show up in <code>Examination</code>, <code>Catholic</code>, <code>ec</code> order, so the third one <code>ec</code> has <code>NA</code> coefficient. In your second specification, these terms show up in <code>ec</code>, <code>Examination</code>, <code>Catholic</code> order, and the third one <code>Catholic</code> has <code>NA</code> coefficient. Note that coefficients estimation is not invariant to specification order, although fitted values are invariant. If LAPACK QR factorization with complete column pivoting is taken, coefficients estimation would be invariant to specification order. However, the position of <code>NA</code> is not as predictable as in LINPACK case, and is purely decided numerically. <hr> <h3>Numerical Examples</h3> LAPACK based QR factorization is implemented in <code>mgcv</code> package. Numerical rank is detected when REML estimation is used, and unidentifiable coefficients are reported as 0 (not <code>NA</code>). So we can make a comparison between <code>lm</code> and <code>gam</code> / <code>bam</code> in linear model estimation. Let's first construct a toy dataset. <pre class="prettyprint"><code>set.seed(0) # an initial full rank matrix X <- matrix(runif(500 * 10), 500) # make the last column as a random linear combination of previous 9 columns X[, 10] <- X[, -10] %*% runif(9) # a random response Y <- rnorm(500) </code></pre> Now we shuffle columns of <code>X</code> to see whether <code>NA</code> changes its position under <code>lm</code> estimation, or whether 0 changes its position under <code>gam</code> and <code>bam</code> estimation. <pre class="prettyprint"><code>test <- function (fun = lm, seed = 0, ...) { shuffleFit <- function (fun) { shuffle <- sample.int(ncol(X)) Xs <- X[, shuffle] b <- unname(coef(fun(Y ~ Xs, ...))) back <- order(shuffle) c(b[1], b[-1][back]) } set.seed(seed) oo <- t(replicate(10, shuffleFit(fun))) colnames(oo) <- c("intercept", paste0("X", 1:ncol(X))) oo } </code></pre> <hr> First we check with <code>lm</code>: <pre class="prettyprint"><code>test(fun = lm) </code></pre> We see that <code>NA</code> changes its position with column shuffling of <code>X</code>. Estimated coefficients vary, too. <hr> Now we check with <code>gam</code> <pre class="prettyprint"><code>library(mgcv) test(fun = gam, method = "REML") </code></pre> We see that estimation is invariant to column shuffling of <code>X</code>, and coefficient for <code>X5</code> is always 0. <hr> Finally we check <code>bam</code> (<code>bam</code> is slow for small dataset like here. It is designed for large or super large dataset. So the following is noticeably slower). <pre class="prettyprint"><code>test(fun = bam, gc.level = -1) </code></pre> The result is as same as what we see for <code>gam</code>.

Does R always return NA as a coefficient as a result of linear regression with unnecessary variables?

Tags:

r

linear-regression

regression

lm

coefficients

My question is about the unnecessary predictors, namely the variables that do not provide any new linear information or the variables that are linear combinations of the other predictors. As you can see the swiss dataset has six variables.

library(swiss)
names(swiss)
# "Fertility"        "Agriculture"      "Examination"      "Education"        
# "Catholic"      "Infant.Mortality"

Now I introduce a new variable ec. It is the linear combination of Examination and Education.

ec <- swiss$Examination + swiss$Catholic

When we run a linear regression with unnecessary variables, R drops terms that are linear combinations of other terms and returns NA as their coefficients. The command below illustrates the point perfectly.

lm(Fertility ~ . + ec, swiss)

Coefficients:
 (Intercept)       Agriculture       Examination         Education            
     66.9152           -0.1721           -0.2580           -0.8709 

Catholic  Infant.Mortality    ec

  0.1041            1.0770    NA

However, when we regress first on ec and then all of the regressors as shown below,

lm(Fertility ~ ec + ., swiss)

 Coefficients:
 (Intercept)                ec       Agriculture       Examination           
     66.9152            0.1041           -0.1721           -0.3621           
  Education          Catholic     Infant.Mortality  
    -0.8709                NA            1.0770

I would expect the coefficients of both Catholic and Examination to be NA. The variable ec is linear combination of both of them but in the end the coefficient of Examination is not NA whereas that of the Catholic is NA.

Could anyone explain the reason of that?

765

asked Jun 23 '17 12:06

BRCN

2 Answers

There will be NA?

Yes. Adding these columns does not enlarge column space. The resulting matrix is rank-deficient.

How many NA?

It depends on the numerical rank.

number of NA = number of coefficients - rank of model matrix

In your example, after introducing ec, there will be one NA. Changing the specification order for covariates in the model formula is essentially doing column shuffling for the model matrix. This does not change the matrix rank, so you will always get only one NA regardless of your specification order.

OK, but which one is NA?

lm does LINPACK QR factorization with restricted column pivoting. The order of covariates affects which one is NA. Generally, a "first comes, first serves" principle holds, and the position of NA is quite predictable. Take your examples for illustration. In the first specification, these co-linear terms show up in Examination, Catholic, ec order, so the third one ec has NA coefficient. In your second specification, these terms show up in ec, Examination, Catholic order, and the third one Catholic has NA coefficient. Note that coefficients estimation is not invariant to specification order, although fitted values are invariant.

If LAPACK QR factorization with complete column pivoting is taken, coefficients estimation would be invariant to specification order. However, the position of NA is not as predictable as in LINPACK case, and is purely decided numerically.

Numerical Examples

LAPACK based QR factorization is implemented in mgcv package. Numerical rank is detected when REML estimation is used, and unidentifiable coefficients are reported as 0 (not NA). So we can make a comparison between lm and gam / bam in linear model estimation. Let's first construct a toy dataset.

set.seed(0)

# an initial full rank matrix
X <- matrix(runif(500 * 10), 500)
# make the last column as a random linear combination of previous 9 columns
X[, 10] <- X[, -10] %*% runif(9)

# a random response
Y <- rnorm(500)

Now we shuffle columns of X to see whether NA changes its position under lm estimation, or whether 0 changes its position under gam and bam estimation.

test <- function (fun = lm, seed = 0, ...) {
  shuffleFit <- function (fun) {
    shuffle <- sample.int(ncol(X))
    Xs <- X[, shuffle]
    b <- unname(coef(fun(Y ~ Xs, ...)))
    back <- order(shuffle)
    c(b[1], b[-1][back])
    }
  set.seed(seed)
  oo <- t(replicate(10, shuffleFit(fun)))
  colnames(oo) <- c("intercept", paste0("X", 1:ncol(X)))
  oo
  }

First we check with lm:

test(fun = lm)

We see that NA changes its position with column shuffling of X. Estimated coefficients vary, too.

Now we check with gam

library(mgcv)
test(fun = gam, method = "REML")

We see that estimation is invariant to column shuffling of X, and coefficient for X5 is always 0.

Finally we check bam (bam is slow for small dataset like here. It is designed for large or super large dataset. So the following is noticeably slower).

test(fun = bam, gc.level = -1)

The result is as same as what we see for gam.

answered Sep 27 '22 20:09

Zheyuan Li

ec , examination and catholic are 3 parameters in which you need at least 2 variable to determine the third. The important part is that 2 out of 3 is always required. now when you pass this to lm , the first two of the 3 related variables will get coefficient and the third one will end up with NA . The order of variables are important. I hope this explains why both examination and catholic are not NA. with just ec alone , you cannot determine both examination as well as catholic

answered Sep 27 '22 22:09

niths4u

Related questions
                            
                                How can I maintain a color scheme across ggplots, while dropping unused levels in each plot?
                            
                                How to increase the size of the text in a Bayesian network plot with bnlearn in R
                            
                                R dplyr method to replace all empty factors with NA
                            
                                Adding multiple reactive plots and tables to Shiny app
                            
                                Group by aggregate dynamic column name matching
                            
                                Refering to a variable of the data frame passed in the 'data' parameter of ggplot function
                            
                                Speed up INSERT of 1 million+ rows into Postgres via R using COPY?
                            
                                How to plot a function family in ggplot2
                            
                                Print label on circle markers in leaflet in Rshiny
                            
                                How to do group matching in R?
                            
                                Running a GLM with a Gamma distribution, but data includes zeros
                            
                                Concatenating strings using group_by and summarise in r [duplicate]
                            
                                Why set.seed() affects sample() in R
                            
                                How add named element to R vector with name from a variable
                            
                                devtools equivalent of RStudio Build panel buttons
                            
                                Recommended way for variable scoping [duplicate]
                            
                                Replace random values in a column in a dataframe
                            
                                How to use biglm with more than 2^31 observations
                            
                                Print message into R markdown console while knitting
                            
                                Creating monthly data and expanding data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With