Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple paired t-tests on multiple variables simultaneously using dplyr/tidyverse

Tags:

r

dplyr

Assume a data structure like this:

   ID testA_wave1 testA_wave2 testA_wave3 testB_wave1 testB_wave2 testB_wave3
1   1           3           2           3           6           5           3
2   2           4           4           4           3           6           6
3   3          10           2           1           4           4           4
4   4           5           3          12           2           7           4
5   5           5           3           9           2           4           2
6   6          10           0           2           6           6           5
7   7           6           8           4           6           8           3
8   8           1           5           4           5           6           0
9   9           3           2           7           8           4           4
10 10           4           9           5          11           8           8

What I want to achieve is to calculate a paired t-test for every test separately (in this case meaning testA and testB, but in real-life I have much more tests). I want to do it that way that I compare the first wave of a given test with every other subsequent wave of the same test (meaning testA_wave1 vs testA_wave2 and testA_wave1 vs testA_wave3 in the case of testA).

This way, I was able to achieve it:

df %>%
 gather(variable, value, -ID) %>%
 mutate(wave_ID = paste0("wave", parse_number(variable)),
        variable = ifelse(grepl("testA", variable), "testA",
                     ifelse(grepl("testB", variable), "testB", NA_character_))) %>%
 group_by(wave_ID, variable) %>% 
 summarise(value = list(value)) %>% 
 spread(wave_ID, value) %>% 
 group_by(variable) %>% 
 mutate(p_value_w1w2 = t.test(unlist(wave1), unlist(wave2), paired = TRUE)$p.value,
        p_value_w1w3 = t.test(unlist(wave1), unlist(wave3), paired = TRUE)$p.value) %>%
 select(variable, matches("(p_value)"))

  variable p_value_w1w2 p_value_w1w3
  <chr>           <dbl>        <dbl>
1 testA           0.664        0.921
2 testB           0.146        0.418

However, I would like to see different/more elegant solutions that give similar results. I'm looking mostly for dplyr/tidyverse solutions, but if there is a completely different way to achieve it, I'm not against it.

Sample data:

set.seed(123)
df <- data.frame(ID = 1:20,
testA_wave1 = round(rnorm(20, 5, 3), 0),
testA_wave2 = round(rnorm(20, 5, 3), 0),
testA_wave3 = round(rnorm(20, 5, 3), 0),
testB_wave1 = round(rnorm(20, 5, 3), 0),
testB_wave2 = round(rnorm(20, 5, 3), 0),
testB_wave3 = round(rnorm(20, 5, 3), 0))
like image 858
tmfmnk Avatar asked Mar 08 '19 18:03

tmfmnk


People also ask

Can you do a paired t test with multiple variables?

If you want all the variables compared individually you could do paired tests, yes, or you could equivalently treat them as repeated measures ANOVA.

Does dplyr include Tidyr?

dplyr is a package for making tabular data wrangling easier by using a limited set of functions that can be combined to extract and summarize insights from your data. It pairs nicely with tidyr which enables you to swiftly convert between different data formats (long vs. wide) for plotting and analysis.

Can you do at test with three variables?

for comparing three means you can use Both ANOVA and t test. t test is mainly used to compare two group means. for comparing more than two group means ANOVA is used.

How many variables are in at test?

A t-test is an inferential statistic used to determine if there is a statistically significant difference between the means of two variables.


1 Answers

Update 03/16/2022

The tidyverse has evolved and so should this solution.

First I make a simplifying assumption: If we designed the experiment, then we know what the groups are and how many waves we followed them through. If we don't know, then we can extract this information from the column names. See at below.

library("broom")
library("tidyverse")

tests <- c("A", "B")
waves <- 3

comparisons <-
  list(
    test = tests,
    first = 1,
    later = seq(2, waves)
  ) %>%
  cross_df()
comparisons
#> # A tibble: 4 × 3
#>   test  first later
#>   <chr> <dbl> <int>
#> 1 A         1     2
#> 2 B         1     2
#> 3 A         1     3
#> 4 B         1     3

Transform the data from wide format to long format.

data <- df %>%
  pivot_longer(
    -ID,
    names_to = "test_wave"
  ) %>%
  extract(
    test_wave, c("test", "wave"),
    regex = "test(.+)_wave(.+)",
    convert = TRUE
  )

Then pair the comparisons we want to make with the data we collected. I've added lots of rename statements to make for more readable code but it's not strictly necessary.

comparisons %>%
  inner_join(
    data,
    by = c("test", "first" = "wave")
  ) %>%
  rename(
    value.first = value
  ) %>%
  inner_join(
    data,
    by = c("test", "later" = "wave", "ID")
  ) %>%
  rename(
    value.later = value
  ) %>%
  group_by(
    test, first, later
  ) %>%
  group_modify(
    ~ tidy(t.test(.x$value.first, .x$value.later, paired = TRUE))
  ) %>%
  ungroup() %>%
  pivot_wider(
    id_cols = test,
    names_from = later,
    names_glue = "wave1_vs_wave{later}",
    values_from = p.value
  )
#> # A tibble: 2 × 3
#>   test  wave1_vs_wave2 wave1_vs_wave3
#>   <chr>          <dbl>          <dbl>
#> 1 A              0.664          0.921
#> 2 B              0.146          0.418

Appendix: Extract test names and number of waves from column names.

design <- df %>%
  select(starts_with("test")) %>%
  colnames() %>%
  str_match("test(.+)_wave(.+)")
tests <- unique(design[, 2])
waves <- max(as.integer(design[, 3]))

Created on 2022-03-16 by the reprex package (v2.0.1)

Old solution

Here is one way to do it, using purrr quite a bit.

library("tidyverse")

set.seed(123)
df <- tibble(
  ID = 1:20,
  testA_wave1 = round(rnorm(20, 5, 3), 0),
  testA_wave2 = round(rnorm(20, 5, 3), 0),
  testA_wave3 = round(rnorm(20, 5, 3), 0),
  testB_wave1 = round(rnorm(20, 5, 3), 0),
  testB_wave2 = round(rnorm(20, 5, 3), 0),
  testB_wave3 = round(rnorm(20, 5, 3), 0)
)

pvalues <- df %>%
  # From wide tibble to long tibble
  gather(test, value, -ID) %>%
  separate(test, c("test", "wave")) %>%
  # Not stricly necessary; will order the waves alphabetically instead
  mutate(wave = parse_number(wave)) %>%
  inner_join(., ., by = c("ID", "test")) %>%
  # If there are two waves w1 and w2,
  # we end up with pairs (w1, w1), (w1, w2), (w2, w1) and (w2, w2),
  # so filter out to keep the pairing (w1, w2) only
  filter(wave.x == 1, wave.x < wave.y) %>%
  nest(ID, value.x, value.y) %>%
  mutate(pvalue = data %>%
           # Perform the test
           map(~t.test(.$value.x, .$value.y, paired = TRUE)) %>%
           map(broom::tidy) %>%
           # Also not strictly necessary; you might want to keep all
           # information about the test: estimate, statistic, etc.
           map_dbl(pluck, "p.value"))
pvalues
#> # A tibble: 4 x 5
#>   test  wave.x wave.y data              pvalue
#>   <chr>  <dbl>  <dbl> <list>             <dbl>
#> 1 testA      1      2 <tibble [20 x 3]>  0.664
#> 2 testA      1      3 <tibble [20 x 3]>  0.921
#> 3 testB      1      2 <tibble [20 x 3]>  0.146
#> 4 testB      1      3 <tibble [20 x 3]>  0.418

pvalues %>%
  # Drop the data in order to pivot the table
  select(- data) %>%
  unite("waves", wave.x, wave.y, sep = ":") %>%
  spread(waves, pvalue)
#> # A tibble: 2 x 3
#>   test  `1:2` `1:3`
#>   <chr> <dbl> <dbl>
#> 1 testA 0.664 0.921
#> 2 testB 0.146 0.418

Created on 2019-03-08 by the reprex package (v0.2.1)

like image 99
dipetkov Avatar answered Oct 06 '22 09:10

dipetkov