Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

controlling the value (TRUE or FALSE) of dummy variables in interaction terms when using lm()

Tags:

r

When I estimate a model that has an interaction between two variables that don't enter the model as standalone variables, and when one of these variables is a dummy (class "logical") variable, R "flips the sign" of the dummy variable. That is, it reports an estimate of the coefficient on the interaction term when the dummy is FALSE, not when it is TRUE. Here is an example:

data(trees)
trees$dHeight <- trees$Height > 76
trees$cGirth  <- trees$Girth - mean(trees$Girth)
lm(Volume ~ Girth +  Girth:dHeight, data = trees)  # estimate is for  Girth:dHeightTRUE
lm(Volume ~ Girth + cGirth:dHeight, data = trees)  # estimate is for cGirth:dHeightFALSE    

Why does the regression in the last line produce an estimate for an interaction in which dHeight is FALSE rather than TRUE? (I would like R to report the estimate when dHeight is TRUE.)

This is not a big problem, but I would like to better understand why R is doing what it's doing. I know about relevel() and contrasts(), but I can't see that they would make a difference here.

like image 543
user697473 Avatar asked Oct 03 '22 12:10

user697473


2 Answers

The dHeight is logical. Within model this coerced to a factor, and the levels are sorted lexicographically (i.e. F is before T).

As noted in @Hongooi's answer, you can't estimate 4 parameters, so R will fit the terms in the order they appear (FALSE before TRUE)

If you want to force R to fit the TRUE value first you could fit the model to !dHeight

lm(formula = Volume ~ Girth + cGirth:!dHeight, data = trees)

Note that !dHeightFALSE is equivalent of dHeightTRUE

You will also note that in this simple case you are simply changing the sign on the coefficient so it doesn't really matter which model you fit.


EDIT A FAR BETTER APPROACH

R can regcognize that cGirth and Girth are colinear, therefore we can fit remembering that a/b expands to be a + a:b

lm(formula = Volume ~ Girth + cGirth/dHeight, data = trees)
Coefficients:
       (Intercept)               Girth              cGirth  cGirth:dHeightTRUE  
           -27.198               4.251                  NA               1.286

This provides coefficients with easy to interpret names and R will sensibly fail to return a coefficient for cGirth


R can tell Girth and cGirth are colinear, when they are both the model as "main effect" or standalone terms.

There is no way that R should be able to tell when fitting Girth + cGirth:dHeight that cGirth and Girth are colinear and given that dHeight is logical we want cGirthdHeightTRUE to be the coefficient you fit. (you could write your own formula parser to do this if you really wanted)

another approach that would fit the model you wanted, and without any colinear terms would be to use

lm(formula = Volume ~ Girth + I(cGirth*dHeight), data = trees)

which coerces dHeight to numeric (TRUE becomes 1).


Edit to labor the point.

When you fit ~Girth + Girth:dHeight

What you are saying is that there is a main effect for Girth + adjustments for dHeight. R considers the first level of a factor the reference level. The slope for dHeightFALSE is simply the value for Girth, you then have the adjustment for when dHeight == TRUE (Girth:dHeightTRUE).

When you fit ~Girth + cGirth:dHeight -- R does not have a mind-reading parser that can tell that given cGirth and Girth are co-linear when you fit the interaction of the two terms, it will assume that the second level for dHeight is now the reference level)

Imagine if you had a variable that was totally unrelated to Girth

eg

set.seed(1)
trees$cG <- runif(nrow(trees))

Then when you fit Girth + cG:dHeight, you will get 4 parameters estimated

lm(formula = Volume ~ Girth + cG:dHeight, data = trees)

Call:
lm(formula = Volume ~ Girth + cG:dHeight, data = trees)

Coefficients:
    (Intercept)            Girth  cG:dHeightFALSE   cG:dHeightTRUE  
      -31.79645          4.79435         -5.92168          0.09578  

Which is sensible.

When R processes Girth + cGirth:dHeight, it will expand out (with the first level of the factor first) 1 + Girth + cGirth:dHeightFALSE + cGirth:dHeightTRUE -- and will work out that it can't estimate all 4 parameters, and will estimate the first 3.

like image 81
mnel Avatar answered Oct 07 '22 22:10

mnel


R isn't flipping the sign on the dummy variable as such. When you fit ~ Girth + cGirth:dHeight, the cGirth variable is confounded with the intercept term. You can see what's going on by removing the intercept:

> lm(Volume ~ -1 + Girth + cGirth:dHeight, data = trees)

Call:
lm(formula = Volume ~ -1 + Girth + cGirth:dHeight, data = trees)

Coefficients:
              Girth  cGirth:dHeightFALSE   cGirth:dHeightTRUE  
              2.199                2.053                3.339  
like image 43
Hong Ooi Avatar answered Oct 07 '22 23:10

Hong Ooi