Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting model coefficients for formula terms

Tags:

r

glm

If I fit an lm or glm with factors as explanatory variables, I get a model with estimates and SEs for each level of the factor (minus the baseline level). How can I reliably extract that info for estimates relating to each term in the formula?

Example:

> hec = reshape2::melt(HairEyeColor)
> head(hec)
   Hair   Eye  Sex value
1 Black Brown Male    32
2 Brown Brown Male    53
3   Red Brown Male    10

Then fit a model with two factors:

m = glm(value ~ -1 + Hair + Eye, data=hec)

The fitted coefficients are available in a named vector (and as a full table with these rownames via summary(m)$coef):

> coef(m)
HairBlack HairBrown   HairRed HairBlond   EyeBlue  EyeHazel  EyeGreen 
   22.500    44.750    17.875    24.875    -0.625   -15.875   -19.500 

but now I want to get the first four out since they relate to Hair, and the next three as they relate to Eye.

I could do this using string pattern matching (which I have been doing for specific examples, but I'm doing this often enough I want to get past this) but I am hoping there's a package or a function or an easy way to do this without recourse to string matching, which can possibly break if you construct your data perversely enough and get identical matching coefficient names...

(In the following example, both foo and foob terms generate a coefficient named "foobaz".)

> d = data.frame(foo=sample(c("bar","baz","qux","fnord"),100,TRUE), foob=sample(c("ar","az","sqq","moo"),100,TRUE), y=rnorm(100))
> coef(glm(y~foo+foob, data=d))
(Intercept)      foobaz    foofnord      fooqux      foobaz     foobmoo 
-0.37880701  0.40920067  0.10645248  0.19536840  0.21929754 -0.05912854 
    foobsqq 
 0.37137127 

This is also fragile for interaction terms, where the order of interacted factors depends on the order the terms appear in the formula. You could have X:Y in your formula and get a term out with Y:X if Y has already appeared in the formula. Great.

I've looked at the fitted model object for clues as to how outputs relate to formula terms, and I think something could be done using the $terms component, then using that to see how many levels are in each factor, and getting that many elements from the coefficients at a time, but also that requires compensating for intercept terms, and then interaction terms add another level of complexity.

So before I code this all up, has it been done?

like image 968
Spacedman Avatar asked Mar 26 '26 02:03

Spacedman


2 Answers

Example:

d <- data.frame(foo = sample(c("bar","baz","qux","fnord"), 100, TRUE),
                foob = sample(c("ar","az","sqq","moo"), 100, TRUE),
                y = rnorm(100))
mod <- glm(y ~ foo + foob, data = d)

model.matrix() is a function that generates a design matrix for a model. Apart from this, it also contains an attribute "assign", an integer vector with an entry for each column in the matrix giving the term in the formula which gave rise to the column.

attr(model.matrix(mod), "assign")
# [1] 0 1 1 1 2 2 2

Value 0 corresponds to the intercept (if any), and positive values to terms in the order given by the "term.labels" attribute of the terms structure corresponding to the model.

attr(terms(mod),  "term.labels") # or labels(terms(mod))
# [1] "foo"  "foob"

By combining the above, we can write a function that extracts the coefficients of specific term(s) from a model.

getCoef <- function(term, model) {
  ind <- attr(model.matrix(model), "assign")
  lab <- labels(terms(model))
  coef(model)[ind %in% match(term, lab)]
}
getCoef("foo", mod)
#      foobaz    foofnord      fooqux
#  0.12634573  0.27871612 -0.04792015

getCoef("foob", mod)
#     foobaz    foobmoo    foobsqq
# -0.1596896 -0.1452948 -0.0532562

getCoef(c("foo", "foob"), mod)
#      foobaz    foofnord      fooqux      foobaz     foobmoo     foobsqq
#  0.12634573  0.27871612 -0.04792015 -0.15968959 -0.14529477 -0.05325620
# (the two foobaz's from `foo` and `foob` are all extracted)

If a nonexistent term is provided, ...

getCoef("bar", mod)
# named numeric(0)
like image 181
Darren Tsai Avatar answered Mar 28 '26 17:03

Darren Tsai


Try dummy.coef . It will give a named list, one component per factor, with class "dummy_coef" having its own print method. It will include all levels of the factors including ones omitted by coef. When using treatment contrasts, as in the example, the extra level will always be zero. In this example the Brown level of Eye has no coef component but dum$Eye does have a zero dummy.coef component.

hec <- reshape2::melt(HairEyeColor)
m <- glm(value ~ -1 + Hair + Eye, data=hec)
dum <- dummy.coef(m); dum

## Full coefficients are 
##                                         
## Hair:      Black   Brown     Red   Blond
##           22.500  44.750  17.875  24.875
## Eye:       Brown    Blue   Hazel   Green
##           0.000  -0.625 -15.875 -19.500

dum$Hair
##  Black  Brown    Red  Blond 
## 22.500 44.750 17.875 24.875 

dum$Eye
##   Brown    Blue   Hazel   Green 
##   0.000  -0.625 -15.875 -19.500 
like image 27
G. Grothendieck Avatar answered Mar 28 '26 18:03

G. Grothendieck