Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to properly set contrasts in R

I have been asked to see if there is a linear trend in 3 groups of data (5 points each) by using ANOVA and linear contrasts. The 3 groups represent data collected in 2010, 2011 and 2012. I want to use R for this procedure and I have tried both of the following:

contrasts(data$groups, how.many=1) <- contr.poly(3)
contrasts(data$groups)  <- contr.poly(3)

Both ways seem to work fine but give slightly different answers in terms of their p-values. I have no idea which is correct and it is really tricky to find help for this on the web. I would like help figuring out what is the reasoning behind the different answers. I'm not sure if it has something to do with partitioning sums of squares or whatnot.

like image 253
Jimj Avatar asked Feb 14 '14 04:02

Jimj


2 Answers

Both approaches differ with respect to whether a quadratic polynomial is used.

For illustration purposes, have a look at this example, both x and y are a factor with three levels.

x <- y <- gl(3, 2)
# [1] 1 1 2 2 3 3
# Levels: 1 2 3

The first approach creates a contrast matrix for a quadratic polynomial, i.e., with a linear (.L) and a quadratic trend (.Q). The 3 means: Create the 3 - 1th polynomial.

contrasts(x) <- contr.poly(3)
# [1] 1 1 2 2 3 3
# attr(,"contrasts")
#              .L         .Q
# 1 -7.071068e-01  0.4082483
# 2 -7.850462e-17 -0.8164966
# 3  7.071068e-01  0.4082483
# Levels: 1 2 3

In contrast, the second approach results in a polynomial of first order (i.e., a linear trend only). This is due to the argument how.many = 1. Hence, only 1 contrast is created.

contrasts(y, how.many = 1) <- contr.poly(3)
# [1] 1 1 2 2 3 3
# attr(,"contrasts")
#              .L
# 1 -7.071068e-01
# 2 -7.850462e-17
# 3  7.071068e-01
# Levels: 1 2 3

If you're interested in the linear trend only, the second option seems more appropriate for you.

like image 103
Sven Hohenstein Avatar answered Sep 23 '22 16:09

Sven Hohenstein


Changing the contrasts you ask for changes the degrees of freedom of the model. If one model requests linear and quadratic contrasts, and a second specifies only, say, the linear contrast, then the second model has an extra degree of freedom: this will increase the power to test the linear hypothesis, (at the cost of preventing the model fitting the quadratic trend).

Using the full ("nlevels - 1") set of contrasts creates an orthogonal set of contrasts which explore the full set of (independent) response configurations. Cutting back to just one prevents the model from fitting one configuration (in this case the quadratic component which our data in fact possess.

To see how this works, use the built-in dataset mtcars, and test the (confounded) relationship of gears to gallons. We'll hypothesize that the more gears the better (at least up to some point).

df = mtcars # copy the dataset
df$gear = as.ordered(df$gear) # make an ordered factor

Ordered factors default to polynomial contrasts, but we'll set them here to be explicit:

contrasts(df$gear) <- contr.poly(nlevels(df$gear))

Then we can model the relationship.

m1 = lm(mpg ~ gear, data = df);
summary.lm(m1)
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  20.6733     0.9284  22.267  < 2e-16 ***
# gear.L        3.7288     1.7191   2.169  0.03842 *  
# gear.Q       -4.7275     1.4888  -3.175  0.00353 ** 
# 
# Multiple R-squared:  0.4292,  Adjusted R-squared:  0.3898 
# F-statistic:  10.9 on 2 and 29 DF,  p-value: 0.0002948

Note we have F(2,29) = 10.9 for the overall model and p=.038 for our linear effect with an estimated extra 3.7 mpg/gear.

Now let's only request the linear contrast, and run the "same" analysis.

contrasts(df$gear, how.many = 1) <- contr.poly(nlevels(df$gear))
m1 = lm(mpg ~ gear, data = df)
summary.lm(m1)
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   21.317      1.034  20.612   <2e-16 ***
# gear.L         5.548      1.850   2.999   0.0054 ** 
# Multiple R-squared:  0.2307,  Adjusted R-squared:  0.205 
# F-statistic: 8.995 on 1 and 30 DF,  p-value: 0.005401

The linear effect of gear is now bigger (5.5 mpg) and p << .05 - A win? Except the overall model fit is now significantly worse: variance accounted for is now just 23% (was 43%)! Why is clear if we plot the relationship:

plot(mpg ~ gear, data = df) # view the relationship

(confounded) relationship of gears to gallons

So, if you're interested in the linear trend, but also expect (or are unclear about) additional levels of complexity, you should also test these higher polynomials. The quadratic (or, in general, trends up to levels-1).

Note too that in this example the physical mechanism is confounded: We've forgotten that number of gears is confounded with automatic vs manual transmission, and also with weight, and sedan vs sports car.

If someone wants to test the hypothesis that 4 gears is better than 3, they could answer this question :-)

like image 29
tim Avatar answered Sep 24 '22 16:09

tim