Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spreading a two column data frame with tidyr

Tags:

r

dplyr

tidyr

I have a data frame that looks like this:

  a b
1 x 8
2 x 6
3 y 3
4 y 4
5 z 5
6 z 6

and I want to turn it into this:

  x y z
1 8 3 5
2 6 4 6

But calling

library(tidyr)
df <- data.frame(
    a = c("x", "x", "y", "y", "z", "z"),
    b = c(8, 6, 3, 4, 5, 6)
)
df %>% spread(a, b)

returns

   x  y  z
1  8 NA NA
2  6 NA NA
3 NA  3 NA
4 NA  4 NA
5 NA NA  5
6 NA NA  6

What am I doing wrong?

like image 472
ljos Avatar asked Nov 07 '15 16:11

ljos


People also ask

How do you use Tidyr spread?

To use spread() , pass it the name of a data frame, then the name of the key column in the data frame, and then the name of the value column. Pass the column names as they are; do not use quotes. To tidy table2 , you would pass spread() the key column and then the value column.

What does spread function do in R?

The spread() function from the tidyr library can be helpful to spread a key-value pair across different columns. This function also helps reshape the data from long format to wide format. This function works exactly opposite of gather().

How do I convert multiple columns to a single column in R?

To convert multiple columns into single column in an R data frame, we can use unlist function. For example, if we have data frame defined as df and contains four columns then the columns of df can be converted into a single by using data. frame(x=unlist(df)).

Is Tidyr the same as tidyverse?

tidyr is the Tidyverse package for getting data frames to tidy. Recall that in a tidy data frame: each row is a unit of observation. each column is a single piece of information.

How to collapse or combine two columns in a Dataframe?

Notice that each column has text/string. We would like to collapse or combine two similar columns. For example, we would like to collapse ‘A’ and ‘T’ in columns s_A and s_B into ‘AT’. We have all the variables needed to create a dataframe. Now we can use data.frame function and make a new data frame df.

How to make the data long from the wide form?

Each observation is a row. Each value is a cell. We will start with the dataframe df and use magritter or pipe to give the dataframe content to tidyr’s gather function to make the data long from the wide form. Once we have the data in the long form, we can select certain columns and rename it for convenience.

What are the arguments passed to tidyselect Vars_pull?

A data frame. Column names or positions. This is passed to tidyselect::vars_pull (). These arguments are passed by expression and support quasiquotation (you can unquote column names or column positions). If set, missing values will be replaced with this value.

What is the difference between pivot_wider () and spread ()?

Development on spread () is complete, and for new code we recommend switching to pivot_wider (), which is easier to use, more featureful, and still under active development. df %>% spread (key, value) is equivalent to df %>% pivot_wider (names_from = key, values_from = value) See more details in vignette ("pivot").


4 Answers

While I'm aware you're after tidyr, base has a solution in this case:

unstack(df, b~a)

It's also a little bit faster:

Unit: microseconds

                expr     min      lq     mean  median       uq      max neval
 df %>% spread(a, b) 657.699 679.508 717.7725 690.484 724.9795 1648.381   100
  unstack(df, b ~ a) 309.891 335.264 349.4812 341.9635 351.6565 639.738   100

By popular demand, with something bigger

I haven't included the data.table solution as I'm not sure if pass by reference would be a problem for microbenchmark.

library(microbenchmark)
library(tidyr)
library(magrittr)

nlevels <- 3
#Ensure that all levels have the same number of elements
nrow <- 1e6 - 1e6 %% nlevels
df <- data.frame(a=sample(rep(c("x", "y", "z"), length.out=nrow)),
                 b=sample.int(9, nrow, replace=TRUE))

microbenchmark(df %>% spread(a, b),  unstack(df, b ~ a), data.frame(split(df$b,df$a)), do.call(cbind,split(df$b,df$a)))

Even on 1 million, unstack is faster. Notably, the split solution is also very fast.

Unit: milliseconds
                              expr       min        lq      mean    median       uq       max neval
               df %>% spread(a, b) 366.24426 414.46913 450.78504 453.75258 486.1113 542.03722   100
                unstack(df, b ~ a)  47.07663  51.17663  61.24411  53.05315  56.1114 102.71562   100
     data.frame(split(df$b, df$a))  19.44173  19.74379  22.28060  20.18726  22.1372  67.53844   100
 do.call(cbind, split(df$b, df$a))  26.99798  27.41594  31.27944  27.93225  31.2565  79.93624   100
like image 167
sebastian-c Avatar answered Oct 21 '22 12:10

sebastian-c


Somehow like this?

df <- data.frame(ind = rep(1:min(table(df$a)), length(unique(df$a))), df)
df %>% spread(a, b) %>% select(-ind)
  ind x y z
1   1 8 3 5
2   2 6 4 6
like image 41
DatamineR Avatar answered Oct 21 '22 12:10

DatamineR


Another base answer (that also looks like fast):

data.frame(split(df$b,df$a))
like image 36
nicola Avatar answered Oct 21 '22 12:10

nicola


You can do this with dcast and rowid from the data.table package as well:

dat <- dcast(setDT(df), rowid(a) ~ a, value.var = "b")[,a:=NULL]

which gives:

> dat
   x y z
1: 8 3 5
2: 6 4 6

Old solution:

# create a sequence number by group
setDT(df)[, r:=1:.N, by = a]
# reshape to wide format and remove the sequence variable
dat <- dcast(df, r ~ a, value.var = "b")[,r:=NULL]

which gives:

> dat
   x y z
1: 8 3 5
2: 6 4 6
like image 6
Jaap Avatar answered Oct 21 '22 10:10

Jaap