When I estimate a model that has an interaction between two variables that don't enter the model as standalone variables, and when one of these variables is a dummy (class "logical") variable, R "flips the sign" of the dummy variable. That is, it reports an estimate of the coefficient on the interaction term when the dummy is FALSE, not when it is TRUE. Here is an example:
data(trees)
trees$dHeight <- trees$Height > 76
trees$cGirth <- trees$Girth - mean(trees$Girth)
lm(Volume ~ Girth + Girth:dHeight, data = trees) # estimate is for Girth:dHeightTRUE
lm(Volume ~ Girth + cGirth:dHeight, data = trees) # estimate is for cGirth:dHeightFALSE
Why does the regression in the last line produce an estimate for an interaction in which dHeight
is FALSE rather than TRUE? (I would like R to report the estimate when dHeight is TRUE.)
This is not a big problem, but I would like to better understand why R is doing what it's doing. I know about relevel()
and contrasts()
, but I can't see that they would make a difference here.
The dHeight
is logical
. Within model
this coerced to a factor, and the levels are sorted lexicographically (i.e. F is before T).
As noted in @Hongooi's answer, you can't estimate 4 parameters, so R will fit the terms in the order they appear (FALSE before TRUE)
If you want to force R
to fit the TRUE
value first you could fit the model to !dHeight
lm(formula = Volume ~ Girth + cGirth:!dHeight, data = trees)
Note that !dHeightFALSE
is equivalent of dHeightTRUE
You will also note that in this simple case you are simply changing the sign on the coefficient so it doesn't really matter which model you fit.
EDIT A FAR BETTER APPROACH
R can regcognize that cGirth
and Girth
are colinear, therefore we can fit remembering that a/b
expands to be a + a:b
lm(formula = Volume ~ Girth + cGirth/dHeight, data = trees)
Coefficients:
(Intercept) Girth cGirth cGirth:dHeightTRUE
-27.198 4.251 NA 1.286
This provides coefficients with easy to interpret names and R
will sensibly fail to return a coefficient for cGirth
R
can tell Girth
and cGirth
are colinear, when they are both the model as "main effect" or standalone terms.
There is no way that R
should be able to tell when fitting Girth + cGirth:dHeight
that cGirth
and Girth
are colinear and given that dHeight
is logical we want cGirthdHeightTRUE
to be the coefficient you fit. (you could write your own formula parser to do this if you really wanted)
another approach that would fit the model you wanted, and without any colinear terms would be to use
lm(formula = Volume ~ Girth + I(cGirth*dHeight), data = trees)
which coerces dHeight
to numeric (TRUE becomes 1
).
Edit to labor the point.
When you fit ~Girth + Girth:dHeight
What you are saying is that there is a main effect for Girth
+ adjustments for dHeight
. R considers the first level of a factor the reference level. The slope for dHeightFALSE
is simply the value for Girth
, you then have the adjustment for when dHeight == TRUE
(Girth:dHeightTRUE).
When you fit ~Girth + cGirth:dHeight
-- R
does not have a mind-reading parser that can tell that given cGirth
and Girth
are co-linear when you fit the interaction of the two terms, it will assume that the second level for dHeight
is now the reference level)
Imagine if you had a variable that was totally unrelated to Girth
eg
set.seed(1)
trees$cG <- runif(nrow(trees))
Then when you fit Girth + cG:dHeight
, you will get 4 parameters estimated
lm(formula = Volume ~ Girth + cG:dHeight, data = trees)
Call:
lm(formula = Volume ~ Girth + cG:dHeight, data = trees)
Coefficients:
(Intercept) Girth cG:dHeightFALSE cG:dHeightTRUE
-31.79645 4.79435 -5.92168 0.09578
Which is sensible.
When R
processes Girth + cGirth:dHeight
, it will expand out (with the first level of the factor first) 1 + Girth + cGirth:dHeightFALSE + cGirth:dHeightTRUE
-- and will work out that it can't estimate all 4 parameters, and will estimate the first 3.
R isn't flipping the sign on the dummy variable as such. When you fit ~ Girth + cGirth:dHeight
, the cGirth
variable is confounded with the intercept term. You can see what's going on by removing the intercept:
> lm(Volume ~ -1 + Girth + cGirth:dHeight, data = trees)
Call:
lm(formula = Volume ~ -1 + Girth + cGirth:dHeight, data = trees)
Coefficients:
Girth cGirth:dHeightFALSE cGirth:dHeightTRUE
2.199 2.053 3.339
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With