I am trying to understand how R determines reference groups for interactions in a linear model. Consider the following:
df <- structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L,
5L, 5L, 5L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5"), class = "factor"),
year = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L), .Label = c("1", "2"), class = "factor"),
treatment = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
y = c(1.4068142116718, 2.67041187927052, 2.69166439169131,
3.56550324537293, 1.60021286173782, 4.26629963353237, 3.85741108250572,
5.15740731957689, 4.15629768365669, 6.14875441068499, 3.31277276551286,
3.47223277168367, 3.74152201649338, 4.02734382610191, 4.49388620764795,
5.6432833241724, 4.76639399631094, 4.16885857079297, 4.96830394378801,
5.6286092105837, 6.60521404151111, 5.51371821706176, 3.97244221149279,
5.68793413111161, 4.90457233598412, 6.02826151378941, 4.92468415350312,
8.23718422822134, 5.87695836962708, 7.47264895892575)), .Names = c("id",
"year", "treatment", "y"), row.names = c(NA, -30L), class = "data.frame")
lm(y ~ -1 + id + year + year:treatment, df)
#Coefficients:
# id1 id2 id3 id4
# 2.6585 3.9933 4.1161 5.3544
# id5 year2 year1:treatment1 year2:treatment1
# 6.1991 0.7149 -0.6317 NA
R tries to estimate the full set of interactions instead of consistently omitting a reference group. As a result, I am getting NA
's in the results.
Also, R is inconsistent with which groups it drops. I would like to estimate a model with the same omitted group (year1
) in the main effect and interactions. How to force R to omit year1
and year1:treatment1
from the preceding model?
I understand that there are several workarounds for this problem (e.g. creating all the variables by hand and writing them out in the model's formula). But the actual models I am estimating are much more complicated versions of this problem and such a workaround would be cumbersome.
NA as a coefficient in a regression indicates that the variable in question is linearly related to the other variables.
So the coefficient of the interaction β3 = 1.4 can be interpreted as follows: The effect on muscle mass of a 100 g increase in protein intake is 1.4 Kg more for those who exercise versus those who don't.
It's the effect only when the other term in the interaction equals 0. So if an interaction isn't significant, should you drop it? If you are just checking for the presence of an interaction to make sure you are specifying the model correctly, go ahead and drop it.
If one of these answers works for you perhaps you might accept it or request a clarification. If the interaction is not significant, then you should drop it and run a regression without it.
R tries to estimate the full set of interactions instead of consistently omitting a reference group. As a result, I am getting
NA
s in the results.
There is no causality between those two. You get NA
purely because your variables have nesting.
R is inconsistent with which groups it drops. I would like to estimate a model with the same omitted group (
year1
) in the main effect and interactions. How to force R to omityear1
andyear1:treatment1
from the preceding model?
There is no inconsistency but model.matrix
has its own rule. You get seemingly "inconsistent" contrasts because you don't have main effect treatment
but only interaction treatment:year
.
In the following, I will first explain NA
coefficients, then the contrasts for interaction, and finally give some suggestions.
NA
coefficientsBy default, contrast treatment is used for contrasting factor, and by default of contr.treatement
, the first factor level is taken as reference level. Have a look at your data:
levels(df$id)
# [1] "1" "2" "3" "4" "5"
levels(df$year)
# [1] "1" "2"
levels(df$treatment)
# [1] "0" "1"
Now take a look at a simple linear model:
lm(y ~ id + year + treatment, df)
#Coefficients:
#(Intercept) id2 id3 id4 id5 year2
# 2.153 1.651 1.773 2.696 3.541 1.094
# treatment1
# NA
You can see that id1
, year1
and treatment0
are not there, as they are taken as reference.
Even without interaction, you already have an NA
coefficient. This implies that treatment
is nested in span{id, year}
. When you further include treatment:year
, such nesting still exists.
In fact, a further test shows that treatment
is nested in id
:
lm(y ~ id + treatment, df)
# Coefficients:
#(Intercept) id2 id3 id4 id5 treatment1
# 2.700 1.651 1.773 2.696 3.541 NA
In summary, variable treatment
is completely redundant for your modelling purpose. If you include id
, then there is no need to include treatment
or treatment:*
where *
can be any variable.
It is very easy to get nesting when you have many factor variables in a regression model. Note, contrasts will not necessarily remove nesting, as contrast only recognises variable name, but not potential numerical feature. Take the following example for how to cheat contr.treatment
:
df$ID <- df$id
lm(y ~ id + ID, df)
#Coefficients:
#(Intercept) id2 id3 id4 id5 ID2
# 2.700 1.651 1.773 2.696 3.541 NA
# ID3 ID4 ID5
# NA NA NA
Look, contrasts works as expected, but ID
is nested in id
so we have rank-deficiency.
We first remove the noise imposed by NA
, by dropping id
variable. Then, a regression model with treatment
and year
will be full-rank, so no NA
should be seen if contrasts is successful.
Contrasts for interaction, or high-order effects, depends on the presence of low-order effects. Compare the following models:
lm(y ~ year:treatment, df) ## no low-order effects at all
#Coefficients:
# (Intercept) treatment0:year1 treatment1:year1 treatment0:year2
# 5.4523 -1.3976 -1.3466 -0.6826
#treatment1:year2
# NA
lm(y ~ year + treatment + year:treatment, df) ## with all low-order effects
#Coefficients:
# (Intercept) treatment1 year2 treatment1:year2
# 4.05471 0.05094 0.71493 0.63170
In the first model, no contrasts is done, because there is no presence of main effects, but only the 2nd order effect. The NA
here is due to the absence of contrasts.
Now consider another set of examples, by including some but not all main effects:
lm(y ~ year + year:treatment, df) ## main effect `treatment` is missing
#Coefficients:
# (Intercept) year2 year1:treatment1 year2:treatment1
# 4.05471 0.71493 0.05094 0.68264
lm(y ~ treatment + year:treatment, df) ## main effect `year` is missing
#Coefficients:
# (Intercept) treatment1 treatment0:year2 treatment1:year2
# 4.05471 0.05094 0.71493 1.34663
Here, contrasts for interaction will happen to the variable whose main effect is missing. For example, in the first model, main effect treatment
is missing so interaction drops treatement0
; while in the second model, main effect year
is missing so interaction drops year1
.
Always include all low-order effects when specifying high-order effects. This not only gives easy-to-understand contrasts behaviour, but also has some other appealing statistical reason. You can read Including the interaction but not the main effects in a model on Cross Validated.
Another suggestion, is to always include intercept. In linear regression, a model with intercept yields unbiased estimate, and residuals will have 0 mean.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With