I have a data frame that looks like this: <pre class="prettyprint"><code> a b 1 x 8 2 x 6 3 y 3 4 y 4 5 z 5 6 z 6 </code></pre> and I want to turn it into this: <pre class="prettyprint"><code> x y z 1 8 3 5 2 6 4 6 </code></pre> But calling <pre class="prettyprint"><code>library(tidyr) df <- data.frame( a = c("x", "x", "y", "y", "z", "z"), b = c(8, 6, 3, 4, 5, 6) ) df %>% spread(a, b) </code></pre> returns <pre class="prettyprint"><code> x y z 1 8 NA NA 2 6 NA NA 3 NA 3 NA 4 NA 4 NA 5 NA NA 5 6 NA NA 6 </code></pre> What am I doing wrong?

While I'm aware you're after <code>tidyr</code>, <code>base</code> has a solution in this case: <pre class="prettyprint"><code>unstack(df, b~a) </code></pre> It's also a little bit faster: <pre class="prettyprint"><code>Unit: microseconds expr min lq mean median uq max neval df %>% spread(a, b) 657.699 679.508 717.7725 690.484 724.9795 1648.381 100 unstack(df, b ~ a) 309.891 335.264 349.4812 341.9635 351.6565 639.738 100 </code></pre> <h3>By popular demand, with something bigger</h3> I haven't included the <code>data.table</code> solution as I'm not sure if pass by reference would be a problem for <code>microbenchmark</code>. <pre class="prettyprint"><code>library(microbenchmark) library(tidyr) library(magrittr) nlevels <- 3 #Ensure that all levels have the same number of elements nrow <- 1e6 - 1e6 %% nlevels df <- data.frame(a=sample(rep(c("x", "y", "z"), length.out=nrow)), b=sample.int(9, nrow, replace=TRUE)) microbenchmark(df %>% spread(a, b), unstack(df, b ~ a), data.frame(split(df$b,df$a)), do.call(cbind,split(df$b,df$a))) </code></pre> Even on 1 million, unstack is faster. Notably, the <code>split</code> solution is also very fast. <pre class="prettyprint"><code>Unit: milliseconds expr min lq mean median uq max neval df %>% spread(a, b) 366.24426 414.46913 450.78504 453.75258 486.1113 542.03722 100 unstack(df, b ~ a) 47.07663 51.17663 61.24411 53.05315 56.1114 102.71562 100 data.frame(split(df$b, df$a)) 19.44173 19.74379 22.28060 20.18726 22.1372 67.53844 100 do.call(cbind, split(df$b, df$a)) 26.99798 27.41594 31.27944 27.93225 31.2565 79.93624 100 </code></pre>

Another <code>base</code> answer (that also looks like fast): <pre class="prettyprint"><code>data.frame(split(df$b,df$a)) </code></pre>

You can do this with <code>dcast</code> and <code>rowid</code> from the data.table package as well: <pre class="prettyprint"><code>dat <- dcast(setDT(df), rowid(a) ~ a, value.var = "b")[,a:=NULL] </code></pre> which gives: <blockquote> <pre class="prettyprint"><code>> dat x y z 1: 8 3 5 2: 6 4 6 </code></pre> </blockquote> <hr> Old solution: <pre class="prettyprint"><code># create a sequence number by group setDT(df)[, r:=1:.N, by = a] # reshape to wide format and remove the sequence variable dat <- dcast(df, r ~ a, value.var = "b")[,r:=NULL] </code></pre> which gives: <blockquote> <pre class="prettyprint"><code>> dat x y z 1: 8 3 5 2: 6 4 6 </code></pre> </blockquote>

Spreading a two column data frame with tidyr

Tags:

r

dplyr

tidyr

I have a data frame that looks like this:

and I want to turn it into this:

  x y z
1 8 3 5
2 6 4 6

But calling

library(tidyr)
df <- data.frame(
    a = c("x", "x", "y", "y", "z", "z"),
    b = c(8, 6, 3, 4, 5, 6)
)
df %>% spread(a, b)

returns

   x  y  z
1  8 NA NA
2  6 NA NA
3 NA  3 NA
4 NA  4 NA
5 NA NA  5
6 NA NA  6

What am I doing wrong?

472

asked Nov 07 '15 16:11

ljos

4 Answers

While I'm aware you're after tidyr, base has a solution in this case:

unstack(df, b~a)

It's also a little bit faster:

Unit: microseconds

                expr     min      lq     mean  median       uq      max neval
 df %>% spread(a, b) 657.699 679.508 717.7725 690.484 724.9795 1648.381   100
  unstack(df, b ~ a) 309.891 335.264 349.4812 341.9635 351.6565 639.738   100

By popular demand, with something bigger

I haven't included the data.table solution as I'm not sure if pass by reference would be a problem for microbenchmark.

library(microbenchmark)
library(tidyr)
library(magrittr)

nlevels <- 3
#Ensure that all levels have the same number of elements
nrow <- 1e6 - 1e6 %% nlevels
df <- data.frame(a=sample(rep(c("x", "y", "z"), length.out=nrow)),
                 b=sample.int(9, nrow, replace=TRUE))

microbenchmark(df %>% spread(a, b),  unstack(df, b ~ a), data.frame(split(df$b,df$a)), do.call(cbind,split(df$b,df$a)))

Even on 1 million, unstack is faster. Notably, the split solution is also very fast.

Unit: milliseconds
                              expr       min        lq      mean    median       uq       max neval
               df %>% spread(a, b) 366.24426 414.46913 450.78504 453.75258 486.1113 542.03722   100
                unstack(df, b ~ a)  47.07663  51.17663  61.24411  53.05315  56.1114 102.71562   100
     data.frame(split(df$b, df$a))  19.44173  19.74379  22.28060  20.18726  22.1372  67.53844   100
 do.call(cbind, split(df$b, df$a))  26.99798  27.41594  31.27944  27.93225  31.2565  79.93624   100

167

answered Oct 21 '22 12:10

sebastian-c

Somehow like this?

df <- data.frame(ind = rep(1:min(table(df$a)), length(unique(df$a))), df)
df %>% spread(a, b) %>% select(-ind)
  ind x y z
1   1 8 3 5
2   2 6 4 6

answered Oct 21 '22 12:10

DatamineR

Another base answer (that also looks like fast):

data.frame(split(df$b,df$a))

answered Oct 21 '22 12:10

nicola

You can do this with dcast and rowid from the data.table package as well:

dat <- dcast(setDT(df), rowid(a) ~ a, value.var = "b")[,a:=NULL]

which gives:

> dat
   x y z
1: 8 3 5
2: 6 4 6

Old solution:

# create a sequence number by group
setDT(df)[, r:=1:.N, by = a]
# reshape to wide format and remove the sequence variable
dat <- dcast(df, r ~ a, value.var = "b")[,r:=NULL]

which gives:

> dat
   x y z
1: 8 3 5
2: 6 4 6

answered Oct 21 '22 10:10

Jaap

Related questions
                            
                                splitting multiple values in one column into multiple rows R [duplicate]
                            
                                How to rename a column to a variable name "in a tidyverse way"
                            
                                in R, extract part of object from list
                            
                                ggplot2 colour geom_point by factor but geom_smooth based on all data
                            
                                Using non-ASCII characters inside functions for packages
                            
                                Functionality of probability=TRUE in svm function of e1071 package in R
                            
                                R tm package vcorpus: Error in converting corpus to data frame
                            
                                Expand data frame into combinations of row pairs
                            
                                Error when using dplyr inside of a function
                            
                                How to turn off the "Hit <Return> to see next plot" prompt plot3D?
                            
                                How to change column data type of a tibble
                            
                                Non-standard file/directory found at top level: 'README.Rmd' persists even after implementing suggested solutions
                            
                                How to suppress output
                            
                                Search for packages by a particular author
                            
                                Interpolating timeseries
                            
                                Row sum for large term-document matrix / simple_triplet_matrix ?? {tm package}
                            
                                select one row per group with ifelse in data.table
                            
                                R: how to change lattice (levelplot) color theme?
                            
                                Plotting during a loop in RStudio
                            
                                Converting multiple data.table columns to factors in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spreading a two column data frame with tidyr

Tags:

r

dplyr

tidyr

ljos

People also ask

4 Answers

By popular demand, with something bigger

sebastian-c

DatamineR

nicola

Jaap

Recent Activity

Donate For Us