Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - How to contrast code factors and retain meaningful labels in output summary

Tags:

r

r-factor

Ok everyone, once and for all, how do you (emphasis on you, because I'm sure there is more than one way to achieve this) contrast code (treatment, sum, helmert, etc.) and retain a meaningful factor label (so you can make meaningful interpretations of effects) in the glm function?

I understand I can use level() to understand which factor level is the reference, but that get's tedious when I begin to involve factors with 5 or 10 levels and their interactions.

Here is a quick two factor example of what I mean

outcome <- c(1,0,0,1,1,0,0,0,1, 0, 0, 1)
firstvar <- c("A", "B", "C", "C", "B", "B", "A", "A", "C", "A", "C", "B")
secondvar <- c("D", "D", "E", "F", "F", "E", "D", "E", "F", "F", "D", "E")
df <- as.data.frame(cbind(outcome, firstvar, secondvar))

df$firstvar <- as.factor(df$firstvar)
df$secondvar <- as.factor(df$secondvar)

#not coded manually (and default appears to be dummy or treatment coding)
#gives meaningful factor labels in summary function
summary(glm(outcome ~ firstvar*secondvar, data=df, family="binomial"))

#effects coded
#does not give meaningful factor labels
contrasts(df$firstvar)=contr.sum(3)
contrasts(df$secondvar)=contr.sum(3)
summary(glm(outcome ~ firstvar*secondvar, data=df, family="binomial"))

#dummy coded
contrasts(df$firstvar)=contr.treatment(3); 
contrasts(df$secondvar)=contr.treatment(3); 
summary(glm(outcome ~ firstvar*secondvar, data=df, family="binomial"))

Any and all suggestions will be appreciated. This problem has bugged me for a while, and I'm sure there is a simple(ish) solution.

like image 314
gh0strider18 Avatar asked Jul 01 '14 17:07

gh0strider18


People also ask

How do you code contrast in R?

In order to set a contrast in R, you can use the contr. _X_() function for treatment, sum, and Helmert contrasts, or define any contrast manually. Be aware that this changes your dataset. You might want to consider creating a new variable as a copy of your original one, and set the contrasts on that variable.

What is a Helmert contrast?

The idea behind Helmert contrasts is to compare each group to the mean of the “previous” ones. That is, the first contrast represents the difference between group 2 and group 1, the second contrast represents the difference between group 3 and the mean of groups 1 and 2, and so on.

What are contrast codes?

Contrast coding allows for recentering of categorical variables such that the intercept of a model is not the mean of one level of a category, but instead the mean of all data points in the data set.

What are contrasts in logistic regression?

Contrasts are used to test for differences among the levels of a factor. You can specify a contrast for each factor in the model. Contrasts represent linear combinations of the parameters.


1 Answers

Well, the simple answer (for contr.treatment at least), is that you should pass the factor levels to the function, rather than just the total count. In most cases this will set the level names correctly. For example

contr.treatment(levels(df$firstvar))

#   B C
# A 0 0
# B 1 0
# C 0 1

and then R uses the column names as labels/suffixes on the coefficients in the regression summary. However, even when passing labels, contr.sum doesn't like to set column names. Here we can create our own wrapper though.

named.contr.sum<-function(x, ...) {
    if (is.factor(x)) {
        x <- levels(x)
    } else if (is.numeric(x) & length(x)==1L) {
        stop("cannot create names with integer value. Pass factor levels")
    }
    x<-contr.sum(x, ...)
    colnames(x) <- apply(x,2,function(x) 
         paste(names(x[x>0]), names(x[x<0]), sep="-")
    )
    x
}

Here we are basically calling calling contr.sum and just adding column names to the result (plus some error checking). You can run this with

named.contr.sum(levels(df$firstvar))

#   A-C B-C
# A   1   0
# B   0   1
# C  -1  -1

I decided to use "A-C" and "B-C" as labels, but you could change that in the code if you like. Then running

contrasts(df$firstvar)=named.contr.sum(levels(df$firstvar))
contrasts(df$secondvar)=named.contr.sum(levels(df$secondvar))

summary(glm(outcome ~ firstvar*secondvar, data=df, family="binomial"))

will give you

Call:

glm(formula = outcome ~ firstvar * secondvar, family = "binomial", 
    data = df)

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)
(Intercept)              -6.855e+00  5.023e+03  -0.001    0.999
firstvarA-C              -6.855e+00  6.965e+03  -0.001    0.999
firstvarB-C               6.855e+00  6.965e+03   0.001    0.999
secondvarD-F             -6.855e+00  6.965e+03  -0.001    0.999
secondvarE-F             -6.855e+00  6.965e+03  -0.001    0.999
firstvarA-C:secondvarD-F  2.057e+01  8.473e+03   0.002    0.998
firstvarB-C:secondvarD-F -1.371e+01  1.033e+04  -0.001    0.999
firstvarA-C:secondvarE-F  7.072e-10  1.033e+04   0.000    1.000
firstvarB-C:secondvarE-F  6.855e+00  8.473e+03   0.001    0.999
like image 189
MrFlick Avatar answered Nov 09 '22 22:11

MrFlick