I'm trying to take columns that are in long format and spread them to wide format as shown below. I'd like to use tidyr to solve this with the data manipulation tools I'm investing in but to make this answer more general please provide other solutions. Here's what I have: <pre class="prettyprint"><code>library(dplyr); library(tidyr) set.seed(10) dat <- data_frame( Person = rep(c("greg", "sally", "sue"), each=2), Time = rep(c("Pre", "Post"), 3), Score1 = round(rnorm(6, mean = 80, sd=4), 0), Score2 = round(jitter(Score1, 15), 0), Score3 = 5 + (Score1 + Score2)/2 ) ## Person Time Score1 Score2 Score3 ## 1 greg Pre 80 78 84.0 ## 2 greg Post 79 80 84.5 ## 3 sally Pre 75 74 79.5 ## 4 sally Post 78 78 83.0 ## 5 sue Pre 81 78 84.5 ## 6 sue Post 82 81 86.5 </code></pre> Desired wide format: <pre class="prettyprint"><code> Person Pre.Score1 Pre.Score2 Pre.Score3 Post.Score1 Post.Score2 Post.Score3 1 greg 80 78 84.0 79 80 84.5 2 sally 75 74 79.5 78 78 83.0 3 sue 81 78 84.5 82 81 86.5 </code></pre> I can do it by doing something like this for each score: <pre class="prettyprint"><code>spread(dat %>% select(Person, Time, Score1), Time, Score1) %>% rename(Score1_Pre = Pre, Score1_Post = Post) </code></pre> And then using <code>_join</code> but that seems verbose and like there's got to be a better way. Related questions: tidyr wide to long with two repeated measures Is it possible to use spread on multiple columns in tidyr similar to dcast?

Using <code>reshape2</code>: <pre class="prettyprint"><code>library(reshape2) dcast(melt(dat), Person ~ Time + variable) </code></pre> Produces: <pre class="prettyprint"><code>Using Person, Time as id variables Person Post_Score1 Post_Score2 Post_Score3 Pre_Score1 Pre_Score2 Pre_Score3 1 greg 79 78 83.5 83 81 87.0 2 sally 82 81 86.5 75 74 79.5 3 sue 78 78 83.0 82 79 85.5 </code></pre>

Using <code>dcast</code> from the <code>data.table</code> package. <pre class="prettyprint"><code>library(data.table)#v1.9.5+ dcast(setDT(dat), Person~Time, value.var=paste0("Score", 1:3)) # Person Score1_Post Score1_Pre Score2_Post Score2_Pre Score3_Post Score3_Pre #1: greg 79 80 80 78 84.5 84.0 #2: sally 78 75 78 74 83.0 79.5 #3: sue 82 81 81 78 86.5 84.5 </code></pre> Or <code>reshape</code> from <code>baseR</code> <pre class="prettyprint"><code>reshape(as.data.frame(dat), idvar='Person', timevar='Time',direction='wide') </code></pre> <hr> <h3>Update</h3> From development version <code>tidyr_0.8.3.9000</code> or CRAN release <code>tidyr_1.0.0</code>, we can use <code>pivot_wider</code> for multiple value columns <pre class="prettyprint"><code>library(tidyr) library(stringr) dat %>% pivot_wider(names_from = Time, values_from = str_c("Score", 1:3)) # A tibble: 3 x 7 # Person Score1_Pre Score1_Post Score2_Pre Score2_Post Score3_Pre Score3_Post # <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #1 greg 80 79 78 80 84 84.5 #2 sally 75 78 74 78 79.5 83 #3 sue 81 82 78 81 84.5 86.5 </code></pre>

I did a benchmark for myself and post it here in case someone is interested: <h3>Code</h3> The setup is chosen from the OP, three variables, two time points. However, the size of the data frames is varied from 1,000 to 100,000 rows. <pre class="prettyprint lang-r prettyprint-override"><code>library(magrittr) library(data.table) library(bench) f1 <- function(dat) { tidyr::gather(dat, key = "key", value = "value", -Person, -Time) %>% tidyr::unite("id", Time, key, sep = ".") %>% tidyr::spread(id, value) } f2 <- function(dat) { reshape2::dcast(melt(dat, id.vars = c("Person", "Time")), Person ~ Time + variable) } f3 <- function(dat) { dcast(melt(dat, id.vars = c("Person", "Time")), Person ~ Time + variable) } create_df <- function(rows) { dat <- expand.grid(Person = factor(1:ceiling(rows/2)), Time = c("1Pre", "2Post")) dat$Score1 <- round(rnorm(nrow(dat), mean = 80, sd = 4), 0) dat$Score2 <- round(jitter(dat$Score1, 15), 0) dat$Score3 <- 5 + (dat$Score1 + dat$Score2)/2 return(dat) } </code></pre> <h3>Results</h3> As you can see, reshape2 is a little bit faster than tidyr, probably because tidyr has a larger overhead. Importantly, data.table excels with > 10,000 rows. <pre class="prettyprint lang-r prettyprint-override"><code>press( rows = 10^(3:5), { dat <- create_df(rows) dat2 <- copy(dat) setDT(dat2) bench::mark(tidyr = f1(dat), reshape2 = f2(dat), datatable = f3(dat2), check = function(x, y) all.equal(x, y, check.attributes = FALSE), min_iterations = 20 ) } ) #> Warning: Some expressions had a GC in every iteration; so filtering is #> disabled. #> # A tibble: 9 x 11 #> expression rows min mean median max `itr/sec` mem_alloc #> <chr> <dbl> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> #> 1 tidyr 1000 5.7ms 6.13ms 6.02ms 10.06ms 163. 2.78MB #> 2 reshape2 1000 2.82ms 3.09ms 2.97ms 8.67ms 323. 1.7MB #> 3 datatable 1000 3.82ms 4ms 3.92ms 8.06ms 250. 2.78MB #> 4 tidyr 10000 19.31ms 20.34ms 19.95ms 22.98ms 49.2 8.24MB #> 5 reshape2 10000 13.81ms 14.4ms 14.4ms 15.6ms 69.4 11.34MB #> 6 datatable 10000 14.56ms 15.16ms 14.91ms 18.93ms 66.0 2.98MB #> 7 tidyr 100000 197.24ms 219.69ms 205.27ms 268.92ms 4.55 90.55MB #> 8 reshape2 100000 164.02ms 195.32ms 176.31ms 284.77ms 5.12 121.69MB #> 9 datatable 100000 51.31ms 60.34ms 58.36ms 113.69ms 16.6 27.36MB #> # ... with 3 more variables: n_gc <dbl>, n_itr <int>, total_time <bch:tm> </code></pre> Created on 2019-02-27 by the reprex package (v0.2.1)

How can I spread repeated measures of multiple variables into wide format?

Tags:

r

tidyr

I'm trying to take columns that are in long format and spread them to wide format as shown below. I'd like to use tidyr to solve this with the data manipulation tools I'm investing in but to make this answer more general please provide other solutions.

Here's what I have:

library(dplyr); library(tidyr)

set.seed(10)
dat <- data_frame(
    Person = rep(c("greg", "sally", "sue"), each=2),
    Time = rep(c("Pre", "Post"), 3),
    Score1 = round(rnorm(6, mean = 80, sd=4), 0),
    Score2 = round(jitter(Score1, 15), 0),
    Score3 = 5 + (Score1 + Score2)/2
)

##   Person Time Score1 Score2 Score3
## 1   greg  Pre     80     78   84.0
## 2   greg Post     79     80   84.5
## 3  sally  Pre     75     74   79.5
## 4  sally Post     78     78   83.0
## 5    sue  Pre     81     78   84.5
## 6    sue Post     82     81   86.5

Desired wide format:

  Person Pre.Score1 Pre.Score2 Pre.Score3  Post.Score1 Post.Score2 Post.Score3
1   greg         80         78       84.0           79          80        84.5
2  sally         75         74       79.5           78          78        83.0
3    sue         81         78       84.5           82          81        86.5

I can do it by doing something like this for each score:

spread(dat %>% select(Person, Time, Score1), Time, Score1) %>% 
    rename(Score1_Pre = Pre, Score1_Post = Post)

And then using _join but that seems verbose and like there's got to be a better way.

Related questions:
tidyr wide to long with two repeated measures
Is it possible to use spread on multiple columns in tidyr similar to dcast?

977

asked Apr 21 '15 14:04

Tyler Rinker

4 Answers

Edit: I'm updating this answer since pivot_wider has been around for a while now and addresses the issue in this question and comments. You can now do

pivot_wider(
    dat, 
    id_cols = 'Person', 
    names_from = 'Time', 
    values_from = c('Score1', 'Score2', 'Score3'), 
    names_glue = '{Time}.{.value}'
)

to get the desired result.

The original answer was

dat %>% 
  gather(temp, score, starts_with("Score")) %>% 
  unite(temp1, Time, temp, sep = ".") %>% 
  spread(temp1, score)

134

answered Oct 19 '22 12:10

konvas

Using reshape2:

library(reshape2)
dcast(melt(dat), Person ~ Time + variable)

Produces:

Using Person, Time as id variables
  Person Post_Score1 Post_Score2 Post_Score3 Pre_Score1 Pre_Score2 Pre_Score3
1   greg          79          78        83.5         83         81       87.0
2  sally          82          81        86.5         75         74       79.5
3    sue          78          78        83.0         82         79       85.5

answered Oct 19 '22 12:10

BrodieG

Using dcast from the data.table package.

library(data.table)#v1.9.5+
dcast(setDT(dat), Person~Time, value.var=paste0("Score", 1:3))
#     Person Score1_Post Score1_Pre Score2_Post Score2_Pre Score3_Post Score3_Pre
#1:   greg          79         80          80         78        84.5       84.0
#2:  sally          78         75          78         74        83.0       79.5
#3:    sue          82         81          81         78        86.5       84.5

Or reshape from baseR

reshape(as.data.frame(dat), idvar='Person', timevar='Time',direction='wide')

Update

From development version tidyr_0.8.3.9000 or CRAN release tidyr_1.0.0, we can use pivot_wider for multiple value columns

library(tidyr)
library(stringr)
dat %>%
     pivot_wider(names_from = Time, values_from = str_c("Score", 1:3))
# A tibble: 3 x 7
#  Person Score1_Pre Score1_Post Score2_Pre Score2_Post Score3_Pre Score3_Post
#   <chr>       <dbl>       <dbl>      <dbl>       <dbl>      <dbl>       <dbl>
#1 greg           80          79         78          80       84          84.5
#2 sally          75          78         74          78       79.5        83  
#3 sue            81          82         78          81       84.5        86.5

answered Oct 19 '22 11:10

akrun

I did a benchmark for myself and post it here in case someone is interested:

Code

The setup is chosen from the OP, three variables, two time points. However, the size of the data frames is varied from 1,000 to 100,000 rows.

library(magrittr)
library(data.table)
library(bench)

f1 <- function(dat) {
    tidyr::gather(dat, key = "key", value = "value", -Person, -Time) %>% 
        tidyr::unite("id", Time, key, sep = ".") %>%
        tidyr::spread(id, value)
}

f2 <- function(dat) {
    reshape2::dcast(melt(dat, id.vars = c("Person", "Time")), Person ~ Time + variable)
}

f3 <- function(dat) {
    dcast(melt(dat, id.vars = c("Person", "Time")), Person ~ Time + variable)
}

create_df <- function(rows) {
    dat <- expand.grid(Person = factor(1:ceiling(rows/2)),
                       Time = c("1Pre", "2Post"))
    dat$Score1 <- round(rnorm(nrow(dat), mean = 80, sd = 4), 0)
    dat$Score2 <- round(jitter(dat$Score1, 15), 0)
    dat$Score3 <- 5 + (dat$Score1 + dat$Score2)/2
    return(dat)
}

Results

As you can see, reshape2 is a little bit faster than tidyr, probably because tidyr has a larger overhead. Importantly, data.table excels with > 10,000 rows.

press(
    rows = 10^(3:5),
    {
        dat <- create_df(rows)
        dat2 <- copy(dat)
        setDT(dat2)
        bench::mark(tidyr     = f1(dat),
                    reshape2  = f2(dat),
                    datatable = f3(dat2),
                    check = function(x, y) all.equal(x, y, check.attributes = FALSE),
                    min_iterations = 20
        )
    }
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 9 x 11
#>   expression   rows      min     mean   median      max `itr/sec` mem_alloc
#>   <chr>       <dbl> <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 tidyr        1000    5.7ms   6.13ms   6.02ms  10.06ms    163.      2.78MB
#> 2 reshape2     1000   2.82ms   3.09ms   2.97ms   8.67ms    323.       1.7MB
#> 3 datatable    1000   3.82ms      4ms   3.92ms   8.06ms    250.      2.78MB
#> 4 tidyr       10000  19.31ms  20.34ms  19.95ms  22.98ms     49.2     8.24MB
#> 5 reshape2    10000  13.81ms   14.4ms   14.4ms   15.6ms     69.4    11.34MB
#> 6 datatable   10000  14.56ms  15.16ms  14.91ms  18.93ms     66.0     2.98MB
#> 7 tidyr      100000 197.24ms 219.69ms 205.27ms 268.92ms      4.55   90.55MB
#> 8 reshape2   100000 164.02ms 195.32ms 176.31ms 284.77ms      5.12  121.69MB
#> 9 datatable  100000  51.31ms  60.34ms  58.36ms 113.69ms     16.6    27.36MB
#> # ... with 3 more variables: n_gc <dbl>, n_itr <int>, total_time <bch:tm>

^{Created on 2019-02-27 by the reprex package (v0.2.1)}

answered Oct 19 '22 10:10

hplieninger

Related questions
                            
                                List of ggplot2 theme options?
                            
                                The R %in% operator
                            
                                Global variables in packages in R
                            
                                Insert a character at a specific location in a string
                            
                                Subsetting R data frame results in mysterious NA rows
                            
                                Change bar plot colour in geom_bar with ggplot2 in r
                            
                                Common main title of a figure panel compiled with par(mfrow)
                            
                                Controlling the order of points in ggplot2?
                            
                                Generate list of all possible combinations of elements of vector
                            
                                Read a CSV from github into R
                            
                                Formatting dates on X axis in ggplot2
                            
                                How to organize large Shiny apps?
                            
                                Unseen factor levels when appending new records with unseen string values to a dataframe, cause Warning and result in NA
                            
                                Custom legend for multiple layer ggplot
                            
                                How to specify "does not contain" in dplyr filter
                            
                                Forcing garbage collection to run in R with the gc() command
                            
                                ggplot2, facet_grid, free scales?
                            
                                How can I check whether a function call results in a warning?
                            
                                Calculate row means on subset of columns
                            
                                Access variable value where the name of variable is stored in a string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With