I am running a linear regression on a number of attributes including two categorical attributes, B
and F
, and I don't get a coefficient value for every factor level I have.
B
has 9 levels and F
has 6 levels. When I initially ran the model (with intercepts), I got 8 coefficients for B
and 5 for F
which I understood as the first level of each being included in the intercept.
I want ranking the levels within B
and F
based on their coefficient so I added -1
after each factor to lock the intercept at 0 so that I could get coefficients for all levels.
Call:
lm(formula = dependent ~ a + B-1 + c + d + e + F-1 + g + h, data = input)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
a 2.082e+03 1.026e+02 20.302 < 2e-16 ***
B1 -1.660e+04 9.747e+02 -17.027 < 2e-16 ***
B2 -1.681e+04 9.379e+02 -17.920 < 2e-16 ***
B3 -1.653e+04 9.254e+02 -17.858 < 2e-16 ***
B4 -1.765e+04 9.697e+02 -18.202 < 2e-16 ***
B5 -1.535e+04 1.388e+03 -11.059 < 2e-16 ***
B6 -1.677e+04 9.891e+02 -16.954 < 2e-16 ***
B7 -1.644e+04 9.694e+02 -16.961 < 2e-16 ***
B8 -1.931e+04 9.899e+02 -19.512 < 2e-16 ***
B9 -1.722e+04 9.071e+02 -18.980 < 2e-16 ***
c -6.928e-01 6.977e-01 -0.993 0.321272
d -3.288e-01 2.613e+00 -0.126 0.899933
e -8.384e-01 1.171e+00 -0.716 0.474396
F2 4.679e+02 2.176e+02 2.150 0.032146 *
F3 7.753e+02 2.035e+02 3.810 0.000159 ***
F4 1.885e+02 1.689e+02 1.116 0.265046
F5 5.194e+02 2.264e+02 2.295 0.022246 *
F6 1.365e+03 2.334e+02 5.848 9.94e-09 ***
g 4.278e+00 7.350e+00 0.582 0.560847
h 2.717e-02 5.100e-03 5.328 1.62e-07 ***
This worked in part, leading to the display of all levels of B
, however F1
is still not displayed. As there is no longer an intercept I am confused why F1
is not in the linear model.
Switching the order of the call so that + F - 1
precedes + B - 1
results in coefficients of all levels of F
being visible but not B1
.
Does anybody know either how to display all levels of both B
and F
, or how to assess the relative weight of F1
compared to other levels of F
from the outputs I have?
This issue is raised over and over again, but unfortunately no satisfying answer has been made which can be an appropriate duplicate target. Looks like I need to write one.
Most people know this is related to "contrasts", but not everyone knows why it is needed, and how to understand its result. We have to look at model matrix in order to fully digest this.
Suppose we are interested in a model with two factors: ~ f + g
(numerical covariates do not matter so I include none of them; the response does not appear in model matrix, so drop it, too). Consider the following reproducible example:
set.seed(0)
f <- sample(gl(3, 4, labels = letters[1:3]))
# [1] c a a b b a c b c b a c
#Levels: a b c
g <- sample(gl(3, 4, labels = LETTERS[1:3]))
# [1] A B A B C B C A C C A B
#Levels: A B C
We start with a model matrix with no contrasts at all:
X0 <- model.matrix(~ f + g, contrasts.arg = list(
f = contr.treatment(n = 3, contrasts = FALSE),
g = contr.treatment(n = 3, contrasts = FALSE)))
# (Intercept) f1 f2 f3 g1 g2 g3
#1 1 0 0 1 1 0 0
#2 1 1 0 0 0 1 0
#3 1 1 0 0 1 0 0
#4 1 0 1 0 0 1 0
#5 1 0 1 0 0 0 1
#6 1 1 0 0 0 1 0
#7 1 0 0 1 0 0 1
#8 1 0 1 0 1 0 0
#9 1 0 0 1 0 0 1
#10 1 0 1 0 0 0 1
#11 1 1 0 0 1 0 0
#12 1 0 0 1 0 1 0
Note, we have:
unname( rowSums(X0[, c("f1", "f2", "f3")]) )
# [1] 1 1 1 1 1 1 1 1 1 1 1 1
unname( rowSums(X0[, c("g1", "g2", "g3")]) )
# [1] 1 1 1 1 1 1 1 1 1 1 1 1
So span{f1, f2, f3} = span{g1, g2, g3} = span{(Intercept)}
. In this full specification, 2 columns are not identifiable. X0
will have column rank 1 + 3 + 3 - 2 = 5
:
qr(X0)$rank
# [1] 5
So, if we fit a linear model with this X0
, 2 coefficients out of 7 parameters will be NA
:
y <- rnorm(12) ## random `y` as a response
lm(y ~ X - 1) ## drop intercept as `X` has intercept already
#X0(Intercept) X0f1 X0f2 X0f3 X0g1
# 0.32118 0.05039 -0.22184 NA -0.92868
# X0g2 X0g3
# -0.48809 NA
What this really implies, is that we have to add 2 linear constraints on 7 parameters, in order to get a full rank model. It does not really matter what these 2 constraints are, but there must be 2 linearly independent constrains. For example, we can do either of the following:
X0
;f1
, f2
and f3
sum to 0, and the same for g1
, g2
and g3
.f
and g
.Note, these three ways end up with three different solutions:
The first two are still in the scope of fixed effect modelling. By "contrasts", we reduce the number of parameters until we get a full rank model matrix; while the other two does not reduce the number of parameters, but effectively reduces the effective degree of freedom.
Now, you are certainly after the "contrasts" way. So, remember, we have to drop 2 columns. They can be
f
and one column from g
, giving to a model ~ f + g
, with f
and g
contrasted;f
or g
, giving to a model ~ f + g - 1
.Now you should be clear, that within the framework of dropping columns, there is no way you can get what you want, because you are expecting to drop only 1 column. The resulting model matrix will still be rank-deficient.
If you really want to have all coefficients there, use constrained least squares, or penalized regression / linear mixed models.
Now, when we have interaction of factors, things are more complicated but the idea is still the same. But given that my answer is already long enough, I don't want to continue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With