Assume a data structure like this: <pre class="prettyprint"><code> ID testA_wave1 testA_wave2 testA_wave3 testB_wave1 testB_wave2 testB_wave3 1 1 3 2 3 6 5 3 2 2 4 4 4 3 6 6 3 3 10 2 1 4 4 4 4 4 5 3 12 2 7 4 5 5 5 3 9 2 4 2 6 6 10 0 2 6 6 5 7 7 6 8 4 6 8 3 8 8 1 5 4 5 6 0 9 9 3 2 7 8 4 4 10 10 4 9 5 11 8 8 </code></pre> What I want to achieve is to calculate a paired t-test for every test separately (in this case meaning testA and testB, but in real-life I have much more tests). I want to do it that way that I compare the first wave of a given test with every other subsequent wave of the same test (meaning testA_wave1 vs testA_wave2 and testA_wave1 vs testA_wave3 in the case of testA). This way, I was able to achieve it: <pre class="prettyprint"><code>df %>% gather(variable, value, -ID) %>% mutate(wave_ID = paste0("wave", parse_number(variable)), variable = ifelse(grepl("testA", variable), "testA", ifelse(grepl("testB", variable), "testB", NA_character_))) %>% group_by(wave_ID, variable) %>% summarise(value = list(value)) %>% spread(wave_ID, value) %>% group_by(variable) %>% mutate(p_value_w1w2 = t.test(unlist(wave1), unlist(wave2), paired = TRUE)$p.value, p_value_w1w3 = t.test(unlist(wave1), unlist(wave3), paired = TRUE)$p.value) %>% select(variable, matches("(p_value)")) variable p_value_w1w2 p_value_w1w3 <chr> <dbl> <dbl> 1 testA 0.664 0.921 2 testB 0.146 0.418 </code></pre> However, I would like to see different/more elegant solutions that give similar results. I'm looking mostly for <code>dplyr</code>/<code>tidyverse</code> solutions, but if there is a completely different way to achieve it, I'm not against it. Sample data: <pre class="prettyprint"><code>set.seed(123) df <- data.frame(ID = 1:20, testA_wave1 = round(rnorm(20, 5, 3), 0), testA_wave2 = round(rnorm(20, 5, 3), 0), testA_wave3 = round(rnorm(20, 5, 3), 0), testB_wave1 = round(rnorm(20, 5, 3), 0), testB_wave2 = round(rnorm(20, 5, 3), 0), testB_wave3 = round(rnorm(20, 5, 3), 0)) </code></pre>

Update 03/16/2022 The tidyverse has evolved and so should this solution. First I make a simplifying assumption: If we designed the experiment, then we know what the groups are and how many waves we followed them through. If we don't know, then we can extract this information from the column names. See at below. <pre class="prettyprint lang-r prettyprint-override"><code>library("broom") library("tidyverse") tests <- c("A", "B") waves <- 3 comparisons <- list( test = tests, first = 1, later = seq(2, waves) ) %>% cross_df() comparisons #> # A tibble: 4 × 3 #> test first later #> <chr> <dbl> <int> #> 1 A 1 2 #> 2 B 1 2 #> 3 A 1 3 #> 4 B 1 3 </code></pre> Transform the data from wide format to long format. <pre class="prettyprint lang-r prettyprint-override"><code>data <- df %>% pivot_longer( -ID, names_to = "test_wave" ) %>% extract( test_wave, c("test", "wave"), regex = "test(.+)_wave(.+)", convert = TRUE ) </code></pre> Then pair the comparisons we want to make with the data we collected. I've added lots of rename statements to make for more readable code but it's not strictly necessary. <pre class="prettyprint lang-r prettyprint-override"><code>comparisons %>% inner_join( data, by = c("test", "first" = "wave") ) %>% rename( value.first = value ) %>% inner_join( data, by = c("test", "later" = "wave", "ID") ) %>% rename( value.later = value ) %>% group_by( test, first, later ) %>% group_modify( ~ tidy(t.test(.x$value.first, .x$value.later, paired = TRUE)) ) %>% ungroup() %>% pivot_wider( id_cols = test, names_from = later, names_glue = "wave1_vs_wave{later}", values_from = p.value ) #> # A tibble: 2 × 3 #> test wave1_vs_wave2 wave1_vs_wave3 #> <chr> <dbl> <dbl> #> 1 A 0.664 0.921 #> 2 B 0.146 0.418 </code></pre> Appendix: Extract test names and number of waves from column names. <pre class="prettyprint"><code>design <- df %>% select(starts_with("test")) %>% colnames() %>% str_match("test(.+)_wave(.+)") tests <- unique(design[, 2]) waves <- max(as.integer(design[, 3])) </code></pre> Created on 2022-03-16 by the reprex package (v2.0.1) Old solution Here is one way to do it, using <code>purrr</code> quite a bit. <pre class="prettyprint lang-r prettyprint-override"><code>library("tidyverse") set.seed(123) df <- tibble( ID = 1:20, testA_wave1 = round(rnorm(20, 5, 3), 0), testA_wave2 = round(rnorm(20, 5, 3), 0), testA_wave3 = round(rnorm(20, 5, 3), 0), testB_wave1 = round(rnorm(20, 5, 3), 0), testB_wave2 = round(rnorm(20, 5, 3), 0), testB_wave3 = round(rnorm(20, 5, 3), 0) ) pvalues <- df %>% # From wide tibble to long tibble gather(test, value, -ID) %>% separate(test, c("test", "wave")) %>% # Not stricly necessary; will order the waves alphabetically instead mutate(wave = parse_number(wave)) %>% inner_join(., ., by = c("ID", "test")) %>% # If there are two waves w1 and w2, # we end up with pairs (w1, w1), (w1, w2), (w2, w1) and (w2, w2), # so filter out to keep the pairing (w1, w2) only filter(wave.x == 1, wave.x < wave.y) %>% nest(ID, value.x, value.y) %>% mutate(pvalue = data %>% # Perform the test map(~t.test(.$value.x, .$value.y, paired = TRUE)) %>% map(broom::tidy) %>% # Also not strictly necessary; you might want to keep all # information about the test: estimate, statistic, etc. map_dbl(pluck, "p.value")) pvalues #> # A tibble: 4 x 5 #> test wave.x wave.y data pvalue #> <chr> <dbl> <dbl> <list> <dbl> #> 1 testA 1 2 <tibble [20 x 3]> 0.664 #> 2 testA 1 3 <tibble [20 x 3]> 0.921 #> 3 testB 1 2 <tibble [20 x 3]> 0.146 #> 4 testB 1 3 <tibble [20 x 3]> 0.418 pvalues %>% # Drop the data in order to pivot the table select(- data) %>% unite("waves", wave.x, wave.y, sep = ":") %>% spread(waves, pvalue) #> # A tibble: 2 x 3 #> test `1:2` `1:3` #> <chr> <dbl> <dbl> #> 1 testA 0.664 0.921 #> 2 testB 0.146 0.418 </code></pre> Created on 2019-03-08 by the reprex package (v0.2.1)

Multiple paired t-tests on multiple variables simultaneously using dplyr/tidyverse

Tags:

r

dplyr

Assume a data structure like this:

   ID testA_wave1 testA_wave2 testA_wave3 testB_wave1 testB_wave2 testB_wave3
1   1           3           2           3           6           5           3
2   2           4           4           4           3           6           6
3   3          10           2           1           4           4           4
4   4           5           3          12           2           7           4
5   5           5           3           9           2           4           2
6   6          10           0           2           6           6           5
7   7           6           8           4           6           8           3
8   8           1           5           4           5           6           0
9   9           3           2           7           8           4           4
10 10           4           9           5          11           8           8

What I want to achieve is to calculate a paired t-test for every test separately (in this case meaning testA and testB, but in real-life I have much more tests). I want to do it that way that I compare the first wave of a given test with every other subsequent wave of the same test (meaning testA_wave1 vs testA_wave2 and testA_wave1 vs testA_wave3 in the case of testA).

This way, I was able to achieve it:

df %>%
 gather(variable, value, -ID) %>%
 mutate(wave_ID = paste0("wave", parse_number(variable)),
        variable = ifelse(grepl("testA", variable), "testA",
                     ifelse(grepl("testB", variable), "testB", NA_character_))) %>%
 group_by(wave_ID, variable) %>% 
 summarise(value = list(value)) %>% 
 spread(wave_ID, value) %>% 
 group_by(variable) %>% 
 mutate(p_value_w1w2 = t.test(unlist(wave1), unlist(wave2), paired = TRUE)$p.value,
        p_value_w1w3 = t.test(unlist(wave1), unlist(wave3), paired = TRUE)$p.value) %>%
 select(variable, matches("(p_value)"))

  variable p_value_w1w2 p_value_w1w3
  <chr>           <dbl>        <dbl>
1 testA           0.664        0.921
2 testB           0.146        0.418

However, I would like to see different/more elegant solutions that give similar results. I'm looking mostly for dplyr/tidyverse solutions, but if there is a completely different way to achieve it, I'm not against it.

Sample data:

set.seed(123)
df <- data.frame(ID = 1:20,
testA_wave1 = round(rnorm(20, 5, 3), 0),
testA_wave2 = round(rnorm(20, 5, 3), 0),
testA_wave3 = round(rnorm(20, 5, 3), 0),
testB_wave1 = round(rnorm(20, 5, 3), 0),
testB_wave2 = round(rnorm(20, 5, 3), 0),
testB_wave3 = round(rnorm(20, 5, 3), 0))

858

asked Mar 08 '19 18:03

tmfmnk

1 Answers

Update 03/16/2022

The tidyverse has evolved and so should this solution.

First I make a simplifying assumption: If we designed the experiment, then we know what the groups are and how many waves we followed them through. If we don't know, then we can extract this information from the column names. See at below.

library("broom")
library("tidyverse")

tests <- c("A", "B")
waves <- 3

comparisons <-
  list(
    test = tests,
    first = 1,
    later = seq(2, waves)
  ) %>%
  cross_df()
comparisons
#> # A tibble: 4 × 3
#>   test  first later
#>   <chr> <dbl> <int>
#> 1 A         1     2
#> 2 B         1     2
#> 3 A         1     3
#> 4 B         1     3

Transform the data from wide format to long format.

data <- df %>%
  pivot_longer(
    -ID,
    names_to = "test_wave"
  ) %>%
  extract(
    test_wave, c("test", "wave"),
    regex = "test(.+)_wave(.+)",
    convert = TRUE
  )

Then pair the comparisons we want to make with the data we collected. I've added lots of rename statements to make for more readable code but it's not strictly necessary.

comparisons %>%
  inner_join(
    data,
    by = c("test", "first" = "wave")
  ) %>%
  rename(
    value.first = value
  ) %>%
  inner_join(
    data,
    by = c("test", "later" = "wave", "ID")
  ) %>%
  rename(
    value.later = value
  ) %>%
  group_by(
    test, first, later
  ) %>%
  group_modify(
    ~ tidy(t.test(.x$value.first, .x$value.later, paired = TRUE))
  ) %>%
  ungroup() %>%
  pivot_wider(
    id_cols = test,
    names_from = later,
    names_glue = "wave1_vs_wave{later}",
    values_from = p.value
  )
#> # A tibble: 2 × 3
#>   test  wave1_vs_wave2 wave1_vs_wave3
#>   <chr>          <dbl>          <dbl>
#> 1 A              0.664          0.921
#> 2 B              0.146          0.418

Appendix: Extract test names and number of waves from column names.

design <- df %>%
  select(starts_with("test")) %>%
  colnames() %>%
  str_match("test(.+)_wave(.+)")
tests <- unique(design[, 2])
waves <- max(as.integer(design[, 3]))

^{Created on 2022-03-16 by the reprex package (v2.0.1)}

Old solution

Here is one way to do it, using purrr quite a bit.

library("tidyverse")

set.seed(123)
df <- tibble(
  ID = 1:20,
  testA_wave1 = round(rnorm(20, 5, 3), 0),
  testA_wave2 = round(rnorm(20, 5, 3), 0),
  testA_wave3 = round(rnorm(20, 5, 3), 0),
  testB_wave1 = round(rnorm(20, 5, 3), 0),
  testB_wave2 = round(rnorm(20, 5, 3), 0),
  testB_wave3 = round(rnorm(20, 5, 3), 0)
)

pvalues <- df %>%
  # From wide tibble to long tibble
  gather(test, value, -ID) %>%
  separate(test, c("test", "wave")) %>%
  # Not stricly necessary; will order the waves alphabetically instead
  mutate(wave = parse_number(wave)) %>%
  inner_join(., ., by = c("ID", "test")) %>%
  # If there are two waves w1 and w2,
  # we end up with pairs (w1, w1), (w1, w2), (w2, w1) and (w2, w2),
  # so filter out to keep the pairing (w1, w2) only
  filter(wave.x == 1, wave.x < wave.y) %>%
  nest(ID, value.x, value.y) %>%
  mutate(pvalue = data %>%
           # Perform the test
           map(~t.test(.$value.x, .$value.y, paired = TRUE)) %>%
           map(broom::tidy) %>%
           # Also not strictly necessary; you might want to keep all
           # information about the test: estimate, statistic, etc.
           map_dbl(pluck, "p.value"))
pvalues
#> # A tibble: 4 x 5
#>   test  wave.x wave.y data              pvalue
#>   <chr>  <dbl>  <dbl> <list>             <dbl>
#> 1 testA      1      2 <tibble [20 x 3]>  0.664
#> 2 testA      1      3 <tibble [20 x 3]>  0.921
#> 3 testB      1      2 <tibble [20 x 3]>  0.146
#> 4 testB      1      3 <tibble [20 x 3]>  0.418

pvalues %>%
  # Drop the data in order to pivot the table
  select(- data) %>%
  unite("waves", wave.x, wave.y, sep = ":") %>%
  spread(waves, pvalue)
#> # A tibble: 2 x 3
#>   test  `1:2` `1:3`
#>   <chr> <dbl> <dbl>
#> 1 testA 0.664 0.921
#> 2 testB 0.146 0.418

^{Created on 2019-03-08 by the reprex package (v0.2.1)}

answered Oct 06 '22 09:10

dipetkov

Related questions
                            
                                Margin adjustments when using ggplot's geom_tile()
                            
                                Numpy for R user?
                            
                                How to import CSV into sqlite using RSqlite?
                            
                                In R linear model, get p-values for only the interaction coefficients
                            
                                ggplot2 - is there a way to override global aesthetic mappings while reusing geom layers
                            
                                Return FALSE for duplicated NA values when using the function duplicated()
                            
                                Stacked histogram from already summarized counts using ggplot2
                            
                                Greek and alpha numeric in ggplot2 axis labels
                            
                                Visualise distances between texts
                            
                                Spacing between boxplots in ggplot2
                            
                                R rename duplicate col and rownames (subindexing)
                            
                                R rename an object / data.frame without intermediary object
                            
                                Calculate row-wise maximum
                            
                                How to convert UTM coordinates to lat and long in R
                            
                                R lapply different function to each element of list
                            
                                Shiny: How to change a background colour of a column?
                            
                                Extract the number of sheets from an Excel workbook in R (without XLConnect)
                            
                                R dplyr join by range or virtual column
                            
                                Cumulative Count Paste
                            
                                Map zip codes to their respective city and state in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With