I tried to perform independent t-test for many columns of a dataframe. For example, i created a data frame
set seed(333)
a <- rnorm(20, 10, 1)
b <- rnorm(20, 15, 2)
c <- rnorm(20, 20, 3)
grp <- rep(c('m', 'y'),10)
test_data <- data.frame(a, b, c, grp)
To run the test, i used with(df, t.test(y ~ group))
with(test_data, t.test(a ~ grp))
with(test_data, t.test(b ~ grp))
with(test_data, t.test(c ~ grp))
I would like to have the outputs like this
mean in group m mean in group y p-value
9.747412 9.878820 0.6944
15.12936 16.49533 0.07798
20.39531 20.20168 0.9027
I wonder how can I achieve the results using
1. for loop
2. apply()
3. perhaps dplyr
This link R: t-test over all columns is related but it was 6 years old. Perhaps there are better ways to do the same thing.
Use select_if
to select only numeric columns then use purrr:map_df
to apply t.test
against grp
. Finally use broom:tidy
to get the results in tidy format
library(tidyverse)
res <- test_data %>%
select_if(is.numeric) %>%
map_df(~ broom::tidy(t.test(. ~ grp)), .id = 'var')
res
#> # A tibble: 3 x 11
#> var estimate estimate1 estimate2 statistic p.value parameter conf.low
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 a -0.259 9.78 10.0 -0.587 0.565 16.2 -1.19
#> 2 b 0.154 15.0 14.8 0.169 0.868 15.4 -1.78
#> 3 c -0.359 20.4 20.7 -0.287 0.778 16.5 -3.00
#> # ... with 3 more variables: conf.high <dbl>, method <chr>,
#> # alternative <chr>
Created on 2019-03-15 by the reprex package (v0.2.1.9000)
Simply extract the estimate and p-value results from t.test
call while iterating through all needed columns with sapply
. Build formulas from a character vector and transpose with t()
for output:
formulas <- paste(names(test_data)[1:(ncol(test_data)-1)], "~ grp")
output <- t(sapply(formulas, function(f) {
res <- t.test(as.formula(f))
c(res$estimate, p.value=res$p.value)
}))
Input data (seeded for reproducibility)
set.seed(333)
a <- rnorm(20, 10, 1)
b <- rnorm(20, 15, 2)
c <- rnorm(20, 20, 3)
grp <- rep(c('m', 'y'),10)
test_data <- data.frame(a, b, c, grp)
Output result
# mean in group m mean in group y p.value
# a ~ grp 9.775477 10.03419 0.5654353
# b ~ grp 14.972888 14.81895 0.8678149
# c ~ grp 20.383679 20.74238 0.7776188
As you asked for a for loop:
a <- rnorm(20, 10, 1)
b <- rnorm(20, 15, 2)
c <- rnorm(20, 20, 3)
grp <- rep(c('m', 'y'),10)
test_data <- data.frame(a, b, c, grp)
meanM=NULL
meanY=NULL
p.value=NULL
for (i in 1:(ncol(test_data)-1)){
meanM=as.data.frame(rbind(meanM, t.test(test_data[,i] ~ grp)$estimate[1]))
meanY=as.data.frame(rbind(meanY, t.test(test_data[,i] ~ grp)$estimate[2]))
p.value=as.data.frame(rbind(p.value, t.test(test_data[,i] ~ grp)$p.value))
}
cbind(meanM, meanY, p.value)
It works, but I am a beginner in R. So maybe there is a more efficient solution
Using lapply
this is rather easy.
I have tested the code with set.seed(7060)
before creating the dataset, in order to make the results reproducible.
tests_list <- lapply(letters[1:3], function(x) t.test(as.formula(paste0(x, "~ grp")), data = test_data))
result <- do.call(rbind, lapply(tests_list, `[[`, "estimate"))
pval <- sapply(tests_list, `[[`, "p.value")
result <- cbind(result, p.value = pval)
result
# mean in group m mean in group y p.value
#[1,] 9.909818 9.658813 0.6167742
#[2,] 14.578926 14.168816 0.6462151
#[3,] 20.682587 19.299133 0.2735725
Note that a real life application would use names(test_data)[1:3]
, not letters[1:3]
, in the first lapply
instruction.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With