I have data from an online survey where respondents go through a loop of questions 1-3 times. The survey software (Qualtrics) records this data in multiple columns—that is, Q3.2 in the survey will have columns <code>Q3.2.1.</code>, <code>Q3.2.2.</code>, and <code>Q3.2.3.</code>: <pre class="prettyprint"><code>df <- data.frame( id = 1:10, time = as.Date('2009-01-01') + 0:9, Q3.2.1. = rnorm(10, 0, 1), Q3.2.2. = rnorm(10, 0, 1), Q3.2.3. = rnorm(10, 0, 1), Q3.3.1. = rnorm(10, 0, 1), Q3.3.2. = rnorm(10, 0, 1), Q3.3.3. = rnorm(10, 0, 1) ) # Sample data id time Q3.2.1. Q3.2.2. Q3.2.3. Q3.3.1. Q3.3.2. Q3.3.3. 1 1 2009-01-01 -0.2059165 -0.29177677 -0.7107192 1.52718069 -0.4484351 -1.21550600 2 2 2009-01-02 -0.1981136 -1.19813815 1.1750200 -0.40380049 -1.8376094 1.03588482 3 3 2009-01-03 0.3514795 -0.27425539 1.1171712 -1.02641801 -2.0646661 -0.35353058 ... </code></pre> I want to combine all the QN.N* columns into tidy individual QN.N columns, ultimately ending up with something like this: <pre class="prettyprint"><code> id time loop_number Q3.2 Q3.3 1 1 2009-01-01 1 -0.20591649 1.52718069 2 2 2009-01-02 1 -0.19811357 -0.40380049 3 3 2009-01-03 1 0.35147949 -1.02641801 ... 11 1 2009-01-01 2 -0.29177677 -0.4484351 12 2 2009-01-02 2 -1.19813815 -1.8376094 13 3 2009-01-03 2 -0.27425539 -2.0646661 ... 21 1 2009-01-01 3 -0.71071921 -1.21550600 22 2 2009-01-02 3 1.17501999 1.03588482 23 3 2009-01-03 3 1.11717121 -0.35353058 ... </code></pre> The <code>tidyr</code> library has the <code>gather()</code> function, which works great for combining one set of columns: <pre class="prettyprint"><code>library(dplyr) library(tidyr) library(stringr) df %>% gather(loop_number, Q3.2, starts_with("Q3.2")) %>% mutate(loop_number = str_sub(loop_number,-2,-2)) %>% select(id, time, loop_number, Q3.2) id time loop_number Q3.2 1 1 2009-01-01 1 -0.20591649 2 2 2009-01-02 1 -0.19811357 3 3 2009-01-03 1 0.35147949 ... 29 9 2009-01-09 3 -0.58581232 30 10 2009-01-10 3 -2.33393981 </code></pre> The resultant data frame has 30 rows, as expected (10 individuals, 3 loops each). However, gathering a second set of columns does not work correctly—it successfully makes the two combined columns <code>Q3.2</code> and <code>Q3.3</code>, but ends up with 90 rows instead of 30 (all combinations of 10 individuals, 3 loops of Q3.2, and 3 loops of Q3.3; the combinations will increase substantially for each group of columns in the actual data): <pre class="prettyprint"><code>df %>% gather(loop_number, Q3.2, starts_with("Q3.2")) %>% gather(loop_number, Q3.3, starts_with("Q3.3")) %>% mutate(loop_number = str_sub(loop_number,-2,-2)) id time loop_number Q3.2 Q3.3 1 1 2009-01-01 1 -0.20591649 1.52718069 2 2 2009-01-02 1 -0.19811357 -0.40380049 3 3 2009-01-03 1 0.35147949 -1.02641801 ... 89 9 2009-01-09 3 -0.58581232 -0.13187024 90 10 2009-01-10 3 -2.33393981 -0.48502131 </code></pre> Is there a way to use multiple calls to <code>gather()</code> like this, combining small subsets of columns like this while maintaining the correct number of rows?

This approach seems pretty natural to me: <pre class="prettyprint"><code>df %>% gather(key, value, -id, -time) %>% extract(key, c("question", "loop_number"), "(Q.\\..)\\.(.)") %>% spread(question, value) </code></pre> First gather all question columns, use <code>extract()</code> to separate into <code>question</code> and <code>loop_number</code>, then <code>spread()</code> question back into the columns. <pre class="prettyprint"><code>#> id time loop_number Q3.2 Q3.3 #> 1 1 2009-01-01 1 0.142259203 -0.35842736 #> 2 1 2009-01-01 2 0.061034802 0.79354061 #> 3 1 2009-01-01 3 -0.525686204 -0.67456611 #> 4 2 2009-01-02 1 -1.044461185 -1.19662936 #> 5 2 2009-01-02 2 0.393808163 0.42384717 </code></pre>

In case you are like me, and cannot work out how to use "regular expression with capturing groups" for <code>extract</code>, the following code replicates the <code>extract(...)</code> line in Hadleys' answer: <pre class="prettyprint"><code>df %>% gather(question_number, value, starts_with("Q3.")) %>% mutate(loop_number = str_sub(question_number,-2,-2), question_number = str_sub(question_number,1,4)) %>% select(id, time, loop_number, question_number, value) %>% spread(key = question_number, value = value) </code></pre> The problem here is that the initial gather forms a key column that is actually a combination of two keys. I chose to use <code>mutate</code> in my original solution in the comments to split this column into two columns with equivalent info, a <code>loop_number</code> column and a <code>question_number</code> column. <code>spread</code> can then be used to transform the long form data, which are key value pairs <code>(question_number, value)</code> to wide form data.

Gather multiple sets of columns

Tags:

r

dplyr

reshape

qualtrics

tidyr

I have data from an online survey where respondents go through a loop of questions 1-3 times. The survey software (Qualtrics) records this data in multiple columns—that is, Q3.2 in the survey will have columns Q3.2.1., Q3.2.2., and Q3.2.3.:

df <- data.frame(
  id = 1:10,
  time = as.Date('2009-01-01') + 0:9,
  Q3.2.1. = rnorm(10, 0, 1),
  Q3.2.2. = rnorm(10, 0, 1),
  Q3.2.3. = rnorm(10, 0, 1),
  Q3.3.1. = rnorm(10, 0, 1),
  Q3.3.2. = rnorm(10, 0, 1),
  Q3.3.3. = rnorm(10, 0, 1)
)

# Sample data

   id       time    Q3.2.1.     Q3.2.2.    Q3.2.3.     Q3.3.1.    Q3.3.2.     Q3.3.3.
1   1 2009-01-01 -0.2059165 -0.29177677 -0.7107192  1.52718069 -0.4484351 -1.21550600
2   2 2009-01-02 -0.1981136 -1.19813815  1.1750200 -0.40380049 -1.8376094  1.03588482
3   3 2009-01-03  0.3514795 -0.27425539  1.1171712 -1.02641801 -2.0646661 -0.35353058
...

I want to combine all the QN.N* columns into tidy individual QN.N columns, ultimately ending up with something like this:

   id       time loop_number        Q3.2        Q3.3
1   1 2009-01-01           1 -0.20591649  1.52718069
2   2 2009-01-02           1 -0.19811357 -0.40380049
3   3 2009-01-03           1  0.35147949 -1.02641801
...
11  1 2009-01-01           2 -0.29177677  -0.4484351
12  2 2009-01-02           2 -1.19813815  -1.8376094
13  3 2009-01-03           2 -0.27425539  -2.0646661
...
21  1 2009-01-01           3 -0.71071921 -1.21550600
22  2 2009-01-02           3  1.17501999  1.03588482
23  3 2009-01-03           3  1.11717121 -0.35353058
...

The tidyr library has the gather() function, which works great for combining one set of columns:

library(dplyr)
library(tidyr)
library(stringr)

df %>% gather(loop_number, Q3.2, starts_with("Q3.2")) %>% 
  mutate(loop_number = str_sub(loop_number,-2,-2)) %>%
  select(id, time, loop_number, Q3.2)


   id       time loop_number        Q3.2
1   1 2009-01-01           1 -0.20591649
2   2 2009-01-02           1 -0.19811357
3   3 2009-01-03           1  0.35147949
...
29  9 2009-01-09           3 -0.58581232
30 10 2009-01-10           3 -2.33393981

The resultant data frame has 30 rows, as expected (10 individuals, 3 loops each). However, gathering a second set of columns does not work correctly—it successfully makes the two combined columns Q3.2 and Q3.3, but ends up with 90 rows instead of 30 (all combinations of 10 individuals, 3 loops of Q3.2, and 3 loops of Q3.3; the combinations will increase substantially for each group of columns in the actual data):

df %>% gather(loop_number, Q3.2, starts_with("Q3.2")) %>% 
  gather(loop_number, Q3.3, starts_with("Q3.3")) %>%
  mutate(loop_number = str_sub(loop_number,-2,-2))


   id       time loop_number        Q3.2        Q3.3
1   1 2009-01-01           1 -0.20591649  1.52718069
2   2 2009-01-02           1 -0.19811357 -0.40380049
3   3 2009-01-03           1  0.35147949 -1.02641801
...
89  9 2009-01-09           3 -0.58581232 -0.13187024
90 10 2009-01-10           3 -2.33393981 -0.48502131

Is there a way to use multiple calls to gather() like this, combining small subsets of columns like this while maintaining the correct number of rows?

944

asked Sep 19 '14 02:09

Andrew

5 Answers

This approach seems pretty natural to me:

df %>%
  gather(key, value, -id, -time) %>%
  extract(key, c("question", "loop_number"), "(Q.\\..)\\.(.)") %>%
  spread(question, value)

First gather all question columns, use extract() to separate into question and loop_number, then spread() question back into the columns.

#>    id       time loop_number         Q3.2        Q3.3
#> 1   1 2009-01-01           1  0.142259203 -0.35842736
#> 2   1 2009-01-01           2  0.061034802  0.79354061
#> 3   1 2009-01-01           3 -0.525686204 -0.67456611
#> 4   2 2009-01-02           1 -1.044461185 -1.19662936
#> 5   2 2009-01-02           2  0.393808163  0.42384717

187

answered Oct 06 '22 03:10

hadley

This could be done using reshape. It is possible with dplyr though.

  colnames(df) <- gsub("\\.(.{2})$", "_\\1", colnames(df))
  colnames(df)[2] <- "Date"
  res <- reshape(df, idvar=c("id", "Date"), varying=3:8, direction="long", sep="_")
  row.names(res) <- 1:nrow(res)
  
   head(res)
  #  id       Date time       Q3.2       Q3.3
  #1  1 2009-01-01    1  1.3709584  0.4554501
  #2  2 2009-01-02    1 -0.5646982  0.7048373
  #3  3 2009-01-03    1  0.3631284  1.0351035
  #4  4 2009-01-04    1  0.6328626 -0.6089264
  #5  5 2009-01-05    1  0.4042683  0.5049551
  #6  6 2009-01-06    1 -0.1061245 -1.7170087

Or using dplyr

  library(tidyr)
  library(dplyr)
  colnames(df) <- gsub("\\.(.{2})$", "_\\1", colnames(df))

  df %>%
     gather(loop_number, "Q3", starts_with("Q3")) %>% 
     separate(loop_number,c("L1", "L2"), sep="_") %>% 
     spread(L1, Q3) %>%
     select(-L2) %>%
     head()
  #  id       time       Q3.2       Q3.3
  #1  1 2009-01-01  1.3709584  0.4554501
  #2  1 2009-01-01  1.3048697  0.2059986
  #3  1 2009-01-01 -0.3066386  0.3219253
  #4  2 2009-01-02 -0.5646982  0.7048373
  #5  2 2009-01-02  2.2866454 -0.3610573
  #6  2 2009-01-02 -1.7813084 -0.7838389

Update

With new version of tidyr, we can use pivot_longer to reshape multiple columns. (Using the changed column names from gsub above)

library(dplyr)
library(tidyr)
df %>% 
    pivot_longer(cols = starts_with("Q3"), 
          names_to = c(".value", "Q3"), names_sep = "_") %>% 
    select(-Q3)
# A tibble: 30 x 4
#      id time         Q3.2    Q3.3
#   <int> <date>      <dbl>   <dbl>
# 1     1 2009-01-01  0.974  1.47  
# 2     1 2009-01-01 -0.849 -0.513 
# 3     1 2009-01-01  0.894  0.0442
# 4     2 2009-01-02  2.04  -0.553 
# 5     2 2009-01-02  0.694  0.0972
# 6     2 2009-01-02 -1.11   1.85  
# 7     3 2009-01-03  0.413  0.733 
# 8     3 2009-01-03 -0.896 -0.271 
#9     3 2009-01-03  0.509 -0.0512
#10     4 2009-01-04  1.81   0.668 
# … with 20 more rows

NOTE: Values are different because there was no set seed in creating the input dataset

answered Oct 06 '22 02:10

akrun

With the recent update to melt.data.table, we can now melt multiple columns. With that, we can do:

require(data.table) ## 1.9.5
melt(setDT(df), id=1:2, measure=patterns("^Q3.2", "^Q3.3"), 
     value.name=c("Q3.2", "Q3.3"), variable.name="loop_number")
 #    id       time loop_number         Q3.2        Q3.3
 # 1:  1 2009-01-01           1 -0.433978480  0.41227209
 # 2:  2 2009-01-02           1 -0.567995351  0.30701144
 # 3:  3 2009-01-03           1 -0.092041353 -0.96024077
 # 4:  4 2009-01-04           1  1.137433487  0.60603396
 # 5:  5 2009-01-05           1 -1.071498263 -0.01655584
 # 6:  6 2009-01-06           1 -0.048376809  0.55889996
 # 7:  7 2009-01-07           1 -0.007312176  0.69872938

You can get the development version from here.

answered Oct 06 '22 02:10

Arun

It's not at all related to "tidyr" and "dplyr", but here's another option to consider: merged.stack from my "splitstackshape" package, V1.4.0 and above.

library(splitstackshape)
merged.stack(df, id.vars = c("id", "time"), 
             var.stubs = c("Q3.2.", "Q3.3."),
             sep = "var.stubs")
#     id       time .time_1       Q3.2.       Q3.3.
#  1:  1 2009-01-01      1. -0.62645381  1.35867955
#  2:  1 2009-01-01      2.  1.51178117 -0.16452360
#  3:  1 2009-01-01      3.  0.91897737  0.39810588
#  4:  2 2009-01-02      1.  0.18364332 -0.10278773
#  5:  2 2009-01-02      2.  0.38984324 -0.25336168
#  6:  2 2009-01-02      3.  0.78213630 -0.61202639
#  7:  3 2009-01-03      1. -0.83562861  0.38767161
# <<:::SNIP:::>>
# 24:  8 2009-01-08      3. -1.47075238 -1.04413463
# 25:  9 2009-01-09      1.  0.57578135  1.10002537
# 26:  9 2009-01-09      2.  0.82122120 -0.11234621
# 27:  9 2009-01-09      3. -0.47815006  0.56971963
# 28: 10 2009-01-10      1. -0.30538839  0.76317575
# 29: 10 2009-01-10      2.  0.59390132  0.88110773
# 30: 10 2009-01-10      3.  0.41794156 -0.13505460
#     id       time .time_1       Q3.2.       Q3.3.

answered Oct 06 '22 01:10

A5C1D2H2I1M1N2O1R2T1

In case you are like me, and cannot work out how to use "regular expression with capturing groups" for extract, the following code replicates the extract(...) line in Hadleys' answer:

df %>% 
    gather(question_number, value, starts_with("Q3.")) %>%
    mutate(loop_number = str_sub(question_number,-2,-2), question_number = str_sub(question_number,1,4)) %>%
    select(id, time, loop_number, question_number, value) %>% 
    spread(key = question_number, value = value)

The problem here is that the initial gather forms a key column that is actually a combination of two keys. I chose to use mutate in my original solution in the comments to split this column into two columns with equivalent info, a loop_number column and a question_number column. spread can then be used to transform the long form data, which are key value pairs (question_number, value) to wide form data.