I'm attempting to run some statistical analyses on a field trial that was constructed over 2 sites over the same growing season.
At both sites (Site
, levels: HF|NW) the experimental design was a RCBD with 4 (n=4) blocks (Block
, levels: 1|2|3|4 within each Site
).
There were 4 treatments - 3 different forms of nitrogen fertiliser and a control (no nitrogen fertiliser) (Treatment
, levels: AN, U, IU, C).
During the field trial there were 3 distinct periods that commenced with fertiliser addition and ended with harvesting of the grass. These periods have been given the levels 1|2|3 under the factor N_app
.
There are a range of measurements that I would like to test the following null hypothesis H0 on:
Treatment
(H0) had no effect on measurement
Two of the measurements I am particularly interested in are: grass yield and ammonia emissions.
Starting with grass yield (Dry_tonnes_ha
) as
shown here, a nice balanced data set
The data can be downloaded in R using the following code:
library(tidyverse)
download.file('https://www.dropbox.com/s/w5ramntwdgpn0e3/HF_NW_grass_yield_data.csv?raw=1', destfile = "HF_NW_grass_yield_data.csv", method = "auto")
raw_data <- read.csv("HF_NW_grass_yield_data.csv", stringsAsFactors = FALSE)
HF_NW_grass <- raw_data %>% mutate_at(vars(Site, N_app, Block, Plot, Treatment), as.factor) %>%
mutate(Date = as.Date(Date, format = "%d/%m/%Y"),
Treatment = factor(Treatment, levels = c("AN", "U", "IU", "C")))
I have had a go at running an ANOVA on this using the following approach:
model_1 <- aov(formula = Dry_tonnes_ha ~ Treatment * N_app + Site/Block, data = HF_NW_grass, projections = TRUE)
I have a few concerns with this.
Firstly, what is the best way to test assumptions? For a simple one-way ANOVA I would use shapiro.test()
and bartlett.test()
on the dependent variable (Dry_tonnes_ha
) to assess normality and heterogeneity of variance. Can I use the same approach here?
Secondly, I am concerned that N_app
is a repeated measure as the same measurement is taken from the same plot over 3 different periods - what is the best way to build this repeated measures into the model?
Thirdly, I'm not sure of the best way to nest Block
within Site
. At both sites the levels of Block
are 1:4. Do I need to have unique Block
levels for each site?
I have another data set for NH3 emissions here. R code to download:
download.file('https://www.dropbox.com/s/0ax16x95m2z3fb5/HF_NW_NH3_emissions.csv?raw=1', destfile = "HF_NW_NH3_emissions.csv", method = "auto")
raw_data_1 <- read.csv("HF_NW_NH3_emissions.csv", stringsAsFactors = FALSE)
HF_NW_NH3 <- raw_data_1 %>% mutate_at(vars(Site, N_app, Block, Plot, Treatment), as.factor) %>%
mutate(Treatment = factor(Treatment, levels = c("AN", "U", "IU", "C")))
For this I have all the concerns above with the addition that the data set is unbalanced.
At HF
for N_app
1 n=3, but for N_app
2 & 3 n=4
At NW
n=4 for all N_app
levels.
At NF
measurements were only made on the Treatment
levels U
and IU
At NW
measuremnts were made on Treatment
levels AN
, U
and IU
I'm not sure how to deal with this added level of complexity. I am tempted to just analyse as 2 separate site (the fact that the N_app
periods are not the same at each site may encourage this approach).
Can I use a type iii sum of squares ANOVA here?
It has been suggested to me that a linear mixed modelling approach may be the way forward but I'm not familiar with using these.
I would welcome your thoughts on any of the above. Thanks for your time.
Rory
To answer your first question on the best way of testing assumptions. While your attempt of using another statistical test, implemented in R, is reasonable, I would actually just visualize the distribution and see if the data meet ANOVA assumptions. This approach may seem somewhat subjective, but it does work in most cases.
aov
despite the slightly bimodal distribution. (It appears that log-transformation help further meet normality assumption. This is something you may consider, especially for downstream analyses.)
par(mfrow=c(2,2))
plot(density(HF_NW_grass$Dry_tonnes_ha), col="red", main="Density")
qqnorm(HF_NW_grass$Dry_tonnes_ha, col="red", main="qqplot")
qqline(HF_NW_grass$Dry_tonnes_ha)
DTH_trans <- log10(HF_NW_grass$Dry_tonnes_ha)
plot(density(DTH_trans), col="blue", main="transformed density")
qqnorm(DTH_trans, col="blue", main="transformed density")
qqline(DTH_trans)
Regarding your second question on what the best way to build repeated measures into the model is: Unfortunately, it is difficult to pinpoint such a "best" model, but based on my knowledge (mostly through genomics big data), you may want to use a linear mixed effect model. This can be implemented through the lme4
R package, for example. Since it appears you already know how to construct a linear model in R, you should have no problem with applying lme4
functions.
Your third question regarding whether to nest two variables is tricky. If I were you, I would start with Site
and Block
as if they were independent factors. However, if you know they are not independent, you should probably nest them.
I think your questions and concerns are quite open-ended. My recommendation is that as long as you have a plausible justification, go ahead and proceed.
I agree with @David C on the use of visual diagnostics. Simple QQ plots should work
# dependent variable.
par(mfrow=c(1,2))
qqnorm(dt[,dry_tonnes_ha]); qqline(dt[,dry_tonnes_ha], probs= c(0.15, 0.85))
qqnorm(log(dt[,dry_tonnes_ha])); qqline(log(dt[,dry_tonnes_ha]), probs= c(0.15, 0.85))
The log transformation looks reasonable to me. You can also see this from the density plot, which is long tailed and somewhat bi-modal
par(mfrow=c(1,1))
plot(density(dt[,dry_tonnes_ha]))
You could alternatively use lineup plots (Buja et al, 2009) if you wish. I'm not sure they're needed in this case. Vignette provided
library(nullabor)
# this may not be the best X variable. I'm not familiar with your data
dt_l <- lineup(null_permute("dry_tonnes_ha"), dt)
qplot(dry_tonnes_ha, treatment, data = dt_l) + facet_wrap(~ .sample)
For the other assumptions, you can just use the standard diagnostic plots from the lm
lm2 <- lm(log(dry_tonnes_ha) ~ treatment * n_app + site/block, data = dt)
plot(lm2)
I don't see anything too troublesome in these plots.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With