Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

aov() error term in R: what's the difference bw Error(id) and Error(id/timevar) specification?

Tags:

r

anova

What is the difference between the aov(depvar~timevar+Error(id)) and the aov(depvar~timevar+Error(id/timevar)) formula specifications? These two variants produce slightly different results.

The same question was once asked here: https://stats.stackexchange.com/questions/60108/how-to-write-the-error-term-in-repeated-measures-anova-in-r However, I'd like to repeat it with a more appropriate example.

Here is an example that I created:

var=rep(NA,180)
id=rep(1:20,each=180/20)
group=rep(rep(1:2,each=9),180/(9*2))
time1=rep(rep(1:3,each=3),180/(3*3))
time2=rep(c(8,15,20),180/3)

var[group==1&time1==1&time2==8]=runif(10,105,115)
var[group==2&time1==1&time2==8]=runif(10,105,115)
var[group==1&time1==1&time2==15]=runif(10,95,105)
var[group==2&time1==1&time2==15]=runif(10,95,105)
var[group==1&time1==1&time2==20]=runif(10,85,95)
var[group==2&time1==1&time2==20]=runif(10,85,95)

var[group==1&time1==2&time2==8]=runif(10,95,105)
var[group==2&time1==2&time2==8]=runif(10,95,105)
var[group==1&time1==2&time2==15]=runif(10,85,95)
var[group==2&time1==2&time2==15]=runif(10,75,85)
var[group==1&time1==2&time2==20]=runif(10,75,85)
var[group==2&time1==2&time2==20]=runif(10,65,75)

var[group==1&time1==3&time2==8]=runif(10,95,105)
var[group==2&time1==3&time2==8]=runif(10,95,105)
var[group==1&time1==3&time2==15]=runif(10,85,95)
var[group==2&time1==3&time2==15]=runif(10,75,85)
var[group==1&time1==3&time2==20]=runif(10,75,85)
var[group==2&time1==3&time2==20]=runif(10,65,75)

df=data.frame(id,var,group,time1,time2)
df$id=factor(df$id)
df$group=factor(df$group)
df$time1=factor(df$time1)
df$time2=factor(df$time2)

Performing aov() on this gets slightly different results depending on Error() term specification:

Just for one time term:

> summary(aov(var~time1+Error(id),data=df))
Error: id
      Df Sum Sq Mean Sq F value Pr(>F)
      Residuals 19  958.4   50.44               
Error: Within
       Df Sum Sq Mean Sq F value   Pr(>F)    
       time1       2   7538    3769   30.41 6.72e-12 ***
       Residuals 158  19584     124         

> summary(aov(var~time1+Error(id/time1),data=df))
Error: id
      Df Sum Sq Mean Sq F value Pr(>F)
      Residuals 19  958.4   50.44               
Error: id:time1
      Df Sum Sq Mean Sq F value Pr(>F)    
      time1      2   7538    3769   211.5 <2e-16 ***
      Residuals 38    677      18                   
      ---
     Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Error: Within
       Df Sum Sq Mean Sq F value Pr(>F)
       Residuals 120  18907   157.6    

Or for both time terms (don't type output here for the sake of space, your may check it on your own):

summary(aov(var~group*time1*time2+Error(id/(group*time1*time2)),data=df)) 
summary(aov(var~group*time1*time2+Error(id),data=df)) 

Why does it happen? Which variant is correct?

like image 399
NeverTim Avatar asked May 28 '16 10:05

NeverTim


2 Answers

Here's a blog post that'll help break down what each means under the "Random Effects in Classical ANOVA" section.

From the blog, here's a summary for what "dividing" in the Error term means.

aov(Y ~ Error(A), data=d)               # Lone random effect
aov(Y ~ B + Error(A/B), data=d)         # A random, B fixed, B nested within A
aov(Y ~ (B*X) + Error(A/(B*X)), data=d) # B and X interact within levels of A

So from your question,

aov(depvar~timevar+Error(id/timevar))

means you have a random effect from id but then fix timevar with timevar nested within id levels versus

aov(depvar~timevar+Error(id))

which is just taking id as the random effects with no constraint on other variables.

Source: http://conjugateprior.org/2013/01/formulae-in-r-anova/

This might prove useful as well, which is some code going over analysis of variance that has some recommendations on learning ANOVA.

like image 133
Eric Leung Avatar answered Sep 19 '22 08:09

Eric Leung


The difference between aov(depvar~timevar+Error(id)) and aov(depvar~timevar+Error(id/timevar)) is whether or not you include timevar as a random effect.

Note that there's more than one way to include a variable as a random effect. You could also use aov(depvar~timevar+Error(id*timevar)) or aov(depvar~timevar+Error(id + timevar)) as well. Each of these means something quite different, but it can be confusing because they'll often give you similar results when applied to the same dataset, due to the constraints of the data themselves.

The slash / used in aov() denotes nesting. When you use /, R automatically expands it to the main effect of the bottom variable plus the interaction between the bottom and the top. For example, A/B automatically expands to A + A:B. This is similar to how A*B automatically expands to A + B + A:B, but with nesting, the variable in the nest never appears outside of its nest (i.e. there can be no main effect of B on its own).

You can see this expansion happening in your output:

> summary(aov(var~time1+Error(id / time1)))

Error: id
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals  1  52.24   52.24               

Error: id:time1
      Df Sum Sq Mean Sq
time1  1   4291    4291

Error: Within
           Df Sum Sq Mean Sq F value  Pr(>F)   
time1       1   1239  1238.7   10.19 0.00167 **
Residuals 176  21399   121.6                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The Error terms denote random effects. Notice you get one for the main effect of id because it's the base of the nest, and one for the interaction between id and time1, because time1 is nested within id (you also get an Error effect for Within which is the basic residual term for the model, i.e. the random effect of the individual observations themselves).

So what's the correct approach for your data?

It depends on 1) how your data are actually structured and 2) what model you intend to run. Note: There's no definitive test you can run on the data to determine the structure or the correct model; this is a thinking exercise rather than a computational one.

In the example models you provided, you have an outcome var, and then what appear to be grouping variables group and id, and then two time variables time1 and time2. Each id is only in 1 group, not across both groups, suggesting that id is nested within group.

> table(group, id)
     id
group 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
    1 9 0 9 0 9 0 9 0 9  0  9  0  9  0  9  0  9  0  9  0
    2 0 9 0 9 0 9 0 9 0  9  0  9  0  9  0  9  0  9  0  9

I'll assume that id refers to a single participant, and the 9 measurements on time1 and time2 are within-subjects tests on each participant (i.e. each participant was measured 9 times on var, so this is a repeated measures design).

To make it concrete, let's say var is a score on some problem solving task, and time1 and time2 are the minutes participants are allowed to study the problem and the amount of time they're given to complete the problem, respectively. Since time1 and time2 are crossed, each participant completes the task 9 times, under each combination of circumstances.

> table(time1, time2)
     time2
time1  8 15 20
    1 20 20 20
    2 20 20 20
    3 20 20 20
> table(time1, time2, id)
, , id = 1

     time2
time1 8 15 20
    1 1  1  1
    2 1  1  1
    3 1  1  1

, , id = 2

     time2
time1 8 15 20
    1 1  1  1
    2 1  1  1
    3 1  1  1
(output truncated)

Participants are tested in groups, with half of the participants in group 1 and the other half in group 2. Perhaps the study was run in classrooms, and group 1 is one class and group 2 is the second class. Probably, group identity is not actually a variable of interest, but we shouldn't leave it out of the model because there may be some nuisance variance resulting from differences between the groups. For example, maybe the first classroom had better lighting, giving all of the members of group 1 and better chance at scoring well on the puzzles than the members of group 2.

Scores, ID, and Group should all be random effects, and time1 and time2 should be fixed effects (note this could vary for the same data if you had different thoughts in the model; e.g. you may want to consider group as fixed depending on your research question).

Given that model, this would be the most complete version of the model, using aov():

aov(var~time1*time2 + Error(group/id/(time1*time2)),data=df)

Here's the output:

> summary(aov(var~time1*time2 + Error(group/id/(time1*time2)),data=df))

Error: group
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals  1  771.7   771.7               

Error: group:id
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals 18  243.8   13.55               

Error: group:id:time1
          Df Sum Sq Mean Sq F value Pr(>F)    
time1      2   7141    3571   181.6 <2e-16 ***
Residuals 38    747      20                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Error: group:id:time2
          Df Sum Sq Mean Sq F value Pr(>F)    
time2      2  16353    8176   434.6 <2e-16 ***
Residuals 38    715      19                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Error: group:id:time1:time2
            Df Sum Sq Mean Sq F value  Pr(>F)   
time1:time2  4  214.5   53.63   5.131 0.00103 **
Residuals   76  794.3   10.45                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Warning message:
In aov(var ~ time1 * time2 + Error(group/id/(time1 * time2)), data = df) :
  Error() model is singular

(Along with the links above, here is some additional guidance on random vs. fixed effects)

like image 29
Rose Hartman Avatar answered Sep 22 '22 08:09

Rose Hartman