I have a dataframe containing different groups, years and their values, for example:
data <- data.frame(
group = c(rep('A', 120), rep('B', 120)),
year = rep(c(rep('2013-2014', 40), rep('2014-2015', 40), rep('2015-2016', 40)), 2),
value = rnorm(240)
)
For each year within each group I want to run a t-test to see whether the values are significantly different to the previous years (I have been using the function t.test(x, y, var.equal = TRUE) to do this on a one-off)
I would like to return the a dataframe along with the p-values, or preferably significant stars generated using gtools::stars.pval(). So to return something like the following
group year significance
A 2013-2014 NA
A 2014-2015 **
A 2015-2016 ***
B 2013-2014 NA
B 2014-2015
B 2015-2016
Where in the above the p value for difference between 2014-2015 and 2013-2014 for 'A' is between 0.001 and 0.01, and the p-value for the difference between 2015-2015 and 2014-2015 for A is <0.001. There is no evidence of any significant difference in any years for B.
There is no guarantee that each of the groups have the same number of years.
What is the best and quickest way of doing this? I was hoping that I could do it using dplyr and group_by by group and year?
Another option is to summarise the data frame, storing all the values in one cell as a list (yes, you can do that - data frames can have nested lists inside!)
Using dplyr:
df=tbl_df(data)
df=arrange(df,group,year) %>% group_by(group,year) %>% summarise(values=list(value))
df=mutate(df,prev_values=lag(values))
df=group_by(df,group,year)
df=filter(df,!any(is.na(unlist(prev_values))))
df=mutate(df,p_value=t.test(unlist(values),unlist(prev_values),var.equal=TRUE)$p.value) %>% print
group year values prev_values p_value
1 A 2014-2015 <dbl[40]> <dbl[40]> 0.7894477
2 A 2015-2016 <dbl[40]> <dbl[40]> 0.2385581
3 B 2014-2015 <dbl[40]> <dbl[40]> 0.3084138
4 B 2015-2016 <dbl[40]> <dbl[40]> 0.2557849
I really liked @MaksimGayduk 's solution. Especially the "trick" with the summarise(values=list(value))
. Haven't used that before and it seems very useful. My alternative, but similar solution, is based on dplyr
and broom
packages.
The differences are that (a) I first create a table with the appropriate info for the t.tests of interest and then I call the corresponding values from the initial df
data frame, and (b) broom package returns all info from t.test output as a dataframe from where you can pick p.value
or anything else you need.
set.seed(15)
df <- data.frame(
group = c(rep('A', 120), rep('B', 120)),
year = rep(c(rep('2013-2014', 40), rep('2014-2015', 40), rep('2015-2016', 40)), 2),
value = rnorm(240)
)
library(dplyr)
library(broom)
df %>%
select(group, year) %>%
arrange(group,year) %>%
distinct() %>%
group_by(group) %>%
mutate(lag_year = lag(year)) %>%
filter(!is.na(lag_year)) %>%
group_by(group, year, lag_year) %>%
do(tidy(t.test(df$value[df$year==.$year & df$group==.$group],
df$value[df$year==.$lag_year & df$group==.$group])))
# Source: local data frame [4 x 11]
# Groups: group, year, lag_year [4]
#
# group year lag_year estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
# (fctr) (fctr) (fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 A 2014-2015 2013-2014 -0.14570115 0.04597952 0.19168066 -0.6752803 0.5016009 74.05084 -0.5756153 0.2842130
# 2 A 2015-2016 2014-2015 -0.02752882 0.01845069 0.04597952 -0.1162621 0.9077438 77.96192 -0.4989302 0.4438726
# 3 B 2014-2015 2013-2014 0.39565472 0.05703318 -0.33862155 1.5776920 0.1187303 77.10933 -0.1037022 0.8950116
# 4 B 2015-2016 2014-2015 -0.07423089 -0.01719771 0.05703318 -0.3048113 0.7613240 77.77704 -0.5590850 0.4106233
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With