Often I need to spread
multiple value columns, as in this question. But I do it often enough that I'd like to be able to write a function that does this.
For example, given the data:
set.seed(42)
dat <- data_frame(id = rep(1:2,each = 2),
grp = rep(letters[1:2],times = 2),
avg = rnorm(4),
sd = runif(4))
> dat
# A tibble: 4 x 4
id grp avg sd
<int> <chr> <dbl> <dbl>
1 1 a 1.3709584 0.6569923
2 1 b -0.5646982 0.7050648
3 2 a 0.3631284 0.4577418
4 2 b 0.6328626 0.7191123
I'd like to create a function that returns something like:
# A tibble: 2 x 5
id a_avg b_avg a_sd b_sd
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1.3709584 -0.5646982 0.6569923 0.7050648
2 2 0.3631284 0.6328626 0.4577418 0.7191123
How can I do that?
We'll return to the answer provided in the question linked to, but for the moment let's start with a more naive approach.
One idea would be to spread
each value column individually, and then join the results, i.e.
library(dplyr)
library(tidyr)
library(tibble)
dat_avg <- dat %>%
select(-sd) %>%
spread(key = grp,value = avg) %>%
rename(a_avg = a,
b_avg = b)
dat_sd <- dat %>%
select(-avg) %>%
spread(key = grp,value = sd) %>%
rename(a_sd = a,
b_sd = b)
> full_join(dat_avg,
dat_sd,
by = 'id')
# A tibble: 2 x 5
id a_avg b_avg a_sd b_sd
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1.3709584 -0.5646982 0.6569923 0.7050648
2 2 0.3631284 0.6328626 0.4577418 0.7191123
(I used a full_join
just in case we run into situations where not all combinations of the join columns appear in all of them.)
Let's start with a function that works like spread
but allows you to pass the key
and value
columns as characters:
spread_chr <- function(data, key_col, value_cols, fill = NA,
convert = FALSE,drop = TRUE,sep = NULL){
n_val <- length(value_cols)
result <- vector(mode = "list", length = n_val)
id_cols <- setdiff(names(data), c(key_col,value_cols))
for (i in seq_along(result)){
result[[i]] <- spread(data = data[,c(id_cols,key_col,value_cols[i]),drop = FALSE],
key = !!key_col,
value = !!value_cols[i],
fill = fill,
convert = convert,
drop = drop,
sep = paste0(sep,value_cols[i],sep))
}
result %>%
purrr::reduce(.f = full_join, by = id_cols)
}
> dat %>%
spread_chr(key_col = "grp",
value_cols = c("avg","sd"),
sep = "_")
# A tibble: 2 x 5
id grp_avg_a grp_avg_b grp_sd_a grp_sd_b
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1.3709584 -0.5646982 0.6569923 0.7050648
2 2 0.3631284 0.6328626 0.4577418 0.7191123
The key ideas here are to unquote the arguments key_col
and value_cols[i]
using the !!
operator, and using the sep
argument in spread
to control the resulting value column names.
If we wanted to convert this function to accept unquoted arguments for the key and value columns, we could modify it like so:
spread_nq <- function(data, key_col,..., fill = NA,
convert = FALSE, drop = TRUE, sep = NULL){
val_quos <- rlang::quos(...)
key_quo <- rlang::enquo(key_col)
value_cols <- unname(tidyselect::vars_select(names(data),!!!val_quos))
key_col <- unname(tidyselect::vars_select(names(data),!!key_quo))
n_val <- length(value_cols)
result <- vector(mode = "list",length = n_val)
id_cols <- setdiff(names(data),c(key_col,value_cols))
for (i in seq_along(result)){
result[[i]] <- spread(data = data[,c(id_cols,key_col,value_cols[i]),drop = FALSE],
key = !!key_col,
value = !!value_cols[i],
fill = fill,
convert = convert,
drop = drop,
sep = paste0(sep,value_cols[i],sep))
}
result %>%
purrr::reduce(.f = full_join,by = id_cols)
}
> dat %>%
spread_nq(key_col = grp,avg,sd,sep = "_")
# A tibble: 2 x 5
id grp_avg_a grp_avg_b grp_sd_a grp_sd_b
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1.3709584 -0.5646982 0.6569923 0.7050648
2 2 0.3631284 0.6328626 0.4577418 0.7191123
The change here is that we capture the unquoted arguments with rlang::quos
and rlang::enquo
and then simply convert them back to characters using tidyselect::vars_select
.
Returning to the solution in the linked question that uses a sequence of gather
, unite
and spread
, we can use what we've learned to make a function like this:
spread_nt <- function(data,key_col,...,fill = NA,
convert = TRUE,drop = TRUE,sep = "_"){
key_quo <- rlang::enquo(key_col)
val_quos <- rlang::quos(...)
value_cols <- unname(tidyselect::vars_select(names(data),!!!val_quos))
key_col <- unname(tidyselect::vars_select(names(data),!!key_quo))
data %>%
gather(key = ..var..,value = ..val..,!!!val_quos) %>%
unite(col = ..grp..,c(key_col,"..var.."),sep = sep) %>%
spread(key = ..grp..,value = ..val..,fill = fill,
convert = convert,drop = drop,sep = NULL)
}
> dat %>%
spread_nt(key_col = grp,avg,sd,sep = "_")
# A tibble: 2 x 5
id a_avg a_sd b_avg b_sd
* <int> <dbl> <dbl> <dbl> <dbl>
1 1 1.3709584 0.6569923 -0.5646982 0.7050648
2 2 0.3631284 0.4577418 0.6328626 0.7191123
This relies on the same techniques from rlang from the last example. We're using some unusual names like ..var..
for our intermediate variables in order to reduce the chances of name collisions with existing columns in our data frame.
Also, we're using the sep
argument in unite
to control the resulting column names, so in this case when we spread
we force sep = NULL
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With