I have a data frame that looks like this:
a b
1 x 8
2 x 6
3 y 3
4 y 4
5 z 5
6 z 6
and I want to turn it into this:
x y z
1 8 3 5
2 6 4 6
But calling
library(tidyr)
df <- data.frame(
a = c("x", "x", "y", "y", "z", "z"),
b = c(8, 6, 3, 4, 5, 6)
)
df %>% spread(a, b)
returns
x y z
1 8 NA NA
2 6 NA NA
3 NA 3 NA
4 NA 4 NA
5 NA NA 5
6 NA NA 6
What am I doing wrong?
To use spread() , pass it the name of a data frame, then the name of the key column in the data frame, and then the name of the value column. Pass the column names as they are; do not use quotes. To tidy table2 , you would pass spread() the key column and then the value column.
The spread() function from the tidyr library can be helpful to spread a key-value pair across different columns. This function also helps reshape the data from long format to wide format. This function works exactly opposite of gather().
To convert multiple columns into single column in an R data frame, we can use unlist function. For example, if we have data frame defined as df and contains four columns then the columns of df can be converted into a single by using data. frame(x=unlist(df)).
tidyr is the Tidyverse package for getting data frames to tidy. Recall that in a tidy data frame: each row is a unit of observation. each column is a single piece of information.
Notice that each column has text/string. We would like to collapse or combine two similar columns. For example, we would like to collapse ‘A’ and ‘T’ in columns s_A and s_B into ‘AT’. We have all the variables needed to create a dataframe. Now we can use data.frame function and make a new data frame df.
Each observation is a row. Each value is a cell. We will start with the dataframe df and use magritter or pipe to give the dataframe content to tidyr’s gather function to make the data long from the wide form. Once we have the data in the long form, we can select certain columns and rename it for convenience.
A data frame. Column names or positions. This is passed to tidyselect::vars_pull (). These arguments are passed by expression and support quasiquotation (you can unquote column names or column positions). If set, missing values will be replaced with this value.
Development on spread () is complete, and for new code we recommend switching to pivot_wider (), which is easier to use, more featureful, and still under active development. df %>% spread (key, value) is equivalent to df %>% pivot_wider (names_from = key, values_from = value) See more details in vignette ("pivot").
While I'm aware you're after tidyr
, base
has a solution in this case:
unstack(df, b~a)
It's also a little bit faster:
Unit: microseconds
expr min lq mean median uq max neval
df %>% spread(a, b) 657.699 679.508 717.7725 690.484 724.9795 1648.381 100
unstack(df, b ~ a) 309.891 335.264 349.4812 341.9635 351.6565 639.738 100
I haven't included the data.table
solution as I'm not sure if pass by reference would be a problem for microbenchmark
.
library(microbenchmark)
library(tidyr)
library(magrittr)
nlevels <- 3
#Ensure that all levels have the same number of elements
nrow <- 1e6 - 1e6 %% nlevels
df <- data.frame(a=sample(rep(c("x", "y", "z"), length.out=nrow)),
b=sample.int(9, nrow, replace=TRUE))
microbenchmark(df %>% spread(a, b), unstack(df, b ~ a), data.frame(split(df$b,df$a)), do.call(cbind,split(df$b,df$a)))
Even on 1 million, unstack is faster. Notably, the split
solution is also very fast.
Unit: milliseconds
expr min lq mean median uq max neval
df %>% spread(a, b) 366.24426 414.46913 450.78504 453.75258 486.1113 542.03722 100
unstack(df, b ~ a) 47.07663 51.17663 61.24411 53.05315 56.1114 102.71562 100
data.frame(split(df$b, df$a)) 19.44173 19.74379 22.28060 20.18726 22.1372 67.53844 100
do.call(cbind, split(df$b, df$a)) 26.99798 27.41594 31.27944 27.93225 31.2565 79.93624 100
Somehow like this?
df <- data.frame(ind = rep(1:min(table(df$a)), length(unique(df$a))), df)
df %>% spread(a, b) %>% select(-ind)
ind x y z
1 1 8 3 5
2 2 6 4 6
Another base
answer (that also looks like fast):
data.frame(split(df$b,df$a))
You can do this with dcast
and rowid
from the data.table package as well:
dat <- dcast(setDT(df), rowid(a) ~ a, value.var = "b")[,a:=NULL]
which gives:
> dat x y z 1: 8 3 5 2: 6 4 6
Old solution:
# create a sequence number by group
setDT(df)[, r:=1:.N, by = a]
# reshape to wide format and remove the sequence variable
dat <- dcast(df, r ~ a, value.var = "b")[,r:=NULL]
which gives:
> dat x y z 1: 8 3 5 2: 6 4 6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With