We're teaching a stats class for biology students and trying to use R as the computing and data visualization platform. As much as possible, we'd like to avoid using extra packages and doing anything terribly "fancy" in R; the focus of the course is on the statistics, not the programming. Nevertheless, we haven't found a very good way of generating an errorbar plot in R for a two factor ANOVA design. We're using the ggplot2 package to make the plot, and while it does have a built-in stat_summary method of generating 95% CI errorbars, the way these are calculated may not always be the right way . Below, I go through the code for the ANOVA by hand and calculate the 95% CIs by hand also (with standard error estimated from the total residual variance, not just the within-group variance ggplot's summary method would use). At the end, there's actually a plot.
So the question is... is there an easier/faster/simpler way to do all of this?
# LIZARD LENGTH DATA
island.1 <- c(0.2, 5.9, 6.1, 6.5)
island.2 <- c(5.6, 14.8, 15.5, 16.4)
island.3 <- c(0.8, 3.9, 4.3, 4.9)
sex.codes <- c("Male", "Female", "Male", "Female")
# PUTTING DATA TOGETHER IN A DATA FRAME
df.1 <- data.frame(island.1, island.2, island.3, sex.codes)
# MELTING THE DATA FRAME INTO LONG FORM
library(reshape)
df.2 <- melt(df.1)
# MEAN BY CELL
mean.island1.male <- with(df.2, mean(value[variable == "island.1" & sex.codes == "Male"]))
mean.island1.female <- with(df.2, mean(value[variable == "island.1" & sex.codes == "Female"]))
mean.island2.male <- with(df.2, mean(value[variable == "island.2" & sex.codes == "Male"]))
mean.island2.female <- with(df.2, mean(value[variable == "island.2" & sex.codes == "Female"]))
mean.island3.male <- with(df.2, mean(value[variable == "island.3" & sex.codes == "Male"]))
mean.island3.female <- with(df.2, mean(value[variable == "island.3" & sex.codes == "Female"]))
# ADDING CELL MEANS TO DATA FRAME
df.2$means[df.2$variable == "island.1" & df.2$sex.codes == "Male"] <- mean.island1.male
df.2$means[df.2$variable == "island.1" & df.2$sex.codes == "Female"] <- mean.island1.female
df.2$means[df.2$variable == "island.2" & df.2$sex.codes == "Male"] <- mean.island2.male
df.2$means[df.2$variable == "island.2" & df.2$sex.codes == "Female"] <- mean.island2.female
df.2$means[df.2$variable == "island.3" & df.2$sex.codes == "Male"] <- mean.island3.male
df.2$means[df.2$variable == "island.3" & df.2$sex.codes == "Female"] <- mean.island3.female
# LINEAR MODEL
lizard.model <- lm(value ~ variable*sex.codes, data=df.2)
# CALCULATING RESIDUALS BY HAND:
df.2$residuals.1 <- df.2$value - df.2$means
# CONFIRMING RESIDUALS FROM LINEAR MODEL:
df.2$residuals.2 <- residuals(lizard.model)
# TWO FACTOR MAIN EFFECT ANOVA
lizard.anova <- anova(lizard.model)
# INTERACTION PLOT
interaction.plot(df.2$variable, df.2$sex.codes, df.2$value)
# SAMPLE SIZE IN EACH CELL
n <- length(df.2$value[df.2$variable == "island.1" & df.2$sex.codes == "Male"])
# > n
# [1] 2
# NOTE: JUST FOR CLARITY, PRETEND n=10
n <- 10
# CALCULATING STANDARD ERROR
island.se <- sqrt(lizard.anova$M[4]/n)
# HALF CONFIDENCE INTERVAL
island.ci.half <- qt(0.95, lizard.anova$D[4]) * island.se
# MAKING SUMMARY DATA FRAME
summary.df <- data.frame(
Means = c(mean.island1.male,
mean.island1.female,
mean.island2.male,
mean.island2.female,
mean.island3.male,
mean.island3.female),
Location = c("island1",
"island1",
"island2",
"island2",
"island3",
"island3"),
Sex = c("male",
"female",
"male",
"female",
"male",
"female"),
CI.half = rep(island.ci.half, 6)
)
# > summary.df
# Means Location Sex CI.half
# 1 3.15 island1 male 2.165215
# 2 6.20 island1 female 2.165215
# 3 10.55 island2 male 2.165215
# 4 15.60 island2 female 2.165215
# 5 2.55 island3 male 2.165215
# 6 4.40 island3 female 2.165215
# GENERATING THE ERRORBAR PLOT
library(ggplot2)
qplot(data=summary.df,
y=Means,
x=Location,
group=Sex,
ymin=Means-CI.half,
ymax=Means+CI.half,
geom=c("point", "errorbar", "line"),
color=Sex,
shape=Sex,
width=0.25) + theme_bw()
I have to admit I'm quite baffled by your code. Don't take this as personal criticism, but I strongly advise you to learn your students to use the power of R as much as possible. They can only benefit from it, and my experience is that they understand faster what's going on if I don't throw lines and lines of code clutter to their heads.
First of all, you don't have to calculate the means by hand. Just do:
df.2$mean <- with(df.2,ave(value,sex.codes,variable,FUN=mean))
See also ?ave
. That is more clear than the clutter of code in your example. If you have the lizard.model, you can just use
fitted(lizard.model)
and compare these values to the means.
Then I strongly disagree with you. What you calculate, is not the standard error on your prediction. To do that correctly, use the predict()
function
outcome <- predict(lizard.model,se.fit=TRUE)
df.2$CI.half <- outcome$se / 2
To get the confidence interval on the predicted means, you have to use the correct formulae if you ever want your students to understand this correctly. Take a look at section 3.5 of this incredibly great Practical Regression and Anova using R from Faraway. It contains tons of code examples where everything is calculated by hand in a convenient and concise way. It will serve both you and your students. I learnt a lot from it and use it often as a guideline when explaining these things to students.
Now to get the summary dataframe you have a few options, but this one works and is quite understandible.
summary.df <- unique(df.2[,-c(3,5,6)])
names(summary.df) <- c('Sex','Location','Means','CI.half')
And now you can just run your plot code as it stands there.
Alternatively, if you want the prediction error on your values, you can use the following:
lizard.predict <- predict(lizard.model,interval='prediction')
df.2$lower <- lizard.predict[,2]
df.2$upper <- lizard.predict[,3]
summary.df <- unique(df.2[,-3])
names(summary.df)[1:3] <- c('Sex','Location','Means')
qplot(data=summary.df,
y=Means,
x=Location,
group=Sex,
ymin=lower,
ymax=upper,
geom=c("point", "errorbar", "line"),
color=Sex,
shape=Sex,
width=0.25) + theme_bw()
PS: If I sound harsh here and there, that's not intended. English is not my mothertongue, and I'm still not acquainted with the subtleties of the language.
Here is another attempt using the sciplot package. Alternative ways to compute the confidence intervals can be passed in parameter ci.fun.
lineplot.CI(variable,value, group =sex.codes , data = df.2, cex = 1.5,
xlab = "Location", ylab = "means", cex.lab = 1.2, x.leg = 1,
col = c("blue","red"), pch = c(16,16))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With