Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spread multiple columns in a function

Tags:

r

dplyr

tidyr

rlang

Often I need to spread multiple value columns, as in this question. But I do it often enough that I'd like to be able to write a function that does this.

For example, given the data:

set.seed(42)
dat <- data_frame(id = rep(1:2,each = 2),
                  grp = rep(letters[1:2],times = 2),
                  avg = rnorm(4),
                  sd = runif(4))
> dat
# A tibble: 4 x 4
     id   grp        avg        sd
  <int> <chr>      <dbl>     <dbl>
1     1     a  1.3709584 0.6569923
2     1     b -0.5646982 0.7050648
3     2     a  0.3631284 0.4577418
4     2     b  0.6328626 0.7191123

I'd like to create a function that returns something like:

# A tibble: 2 x 5
     id     a_avg      b_avg      a_sd      b_sd
  <int>     <dbl>      <dbl>     <dbl>     <dbl>
1     1 1.3709584 -0.5646982 0.6569923 0.7050648
2     2 0.3631284  0.6328626 0.4577418 0.7191123

How can I do that?

like image 703
joran Avatar asked Dec 13 '22 21:12

joran


1 Answers

We'll return to the answer provided in the question linked to, but for the moment let's start with a more naive approach.

One idea would be to spread each value column individually, and then join the results, i.e.

library(dplyr)
library(tidyr)
library(tibble)

dat_avg <- dat %>% 
    select(-sd) %>%
    spread(key = grp,value = avg) %>%
    rename(a_avg = a,
           b_avg = b)

dat_sd <- dat %>% 
    select(-avg) %>%
    spread(key = grp,value = sd) %>%
    rename(a_sd = a,
           b_sd = b)

> full_join(dat_avg,
          dat_sd,
          by = 'id')

# A tibble: 2 x 5
     id     a_avg      b_avg      a_sd      b_sd
  <int>     <dbl>      <dbl>     <dbl>     <dbl>
1     1 1.3709584 -0.5646982 0.6569923 0.7050648
2     2 0.3631284  0.6328626 0.4577418 0.7191123

(I used a full_join just in case we run into situations where not all combinations of the join columns appear in all of them.)

Let's start with a function that works like spread but allows you to pass the key and value columns as characters:

spread_chr <- function(data, key_col, value_cols, fill = NA, 
                       convert = FALSE,drop = TRUE,sep = NULL){
    n_val <- length(value_cols)
    result <- vector(mode = "list", length = n_val)
    id_cols <- setdiff(names(data), c(key_col,value_cols))

    for (i in seq_along(result)){
        result[[i]] <- spread(data = data[,c(id_cols,key_col,value_cols[i]),drop = FALSE],
                              key = !!key_col,
                              value = !!value_cols[i],
                              fill = fill,
                              convert = convert,
                              drop = drop,
                              sep = paste0(sep,value_cols[i],sep))
    }

    result %>%
        purrr::reduce(.f = full_join, by = id_cols)
}

> dat %>%
  spread_chr(key_col = "grp",
             value_cols = c("avg","sd"),
             sep = "_")

# A tibble: 2 x 5
     id grp_avg_a  grp_avg_b  grp_sd_a  grp_sd_b
  <int>     <dbl>      <dbl>     <dbl>     <dbl>
1     1 1.3709584 -0.5646982 0.6569923 0.7050648
2     2 0.3631284  0.6328626 0.4577418 0.7191123

The key ideas here are to unquote the arguments key_col and value_cols[i] using the !! operator, and using the sep argument in spread to control the resulting value column names.

If we wanted to convert this function to accept unquoted arguments for the key and value columns, we could modify it like so:

spread_nq <- function(data, key_col,..., fill = NA, 
                      convert = FALSE, drop = TRUE, sep = NULL){
    val_quos <- rlang::quos(...)
    key_quo <- rlang::enquo(key_col)
    value_cols <- unname(tidyselect::vars_select(names(data),!!!val_quos))
    key_col <- unname(tidyselect::vars_select(names(data),!!key_quo))

    n_val <- length(value_cols)
    result <- vector(mode = "list",length = n_val)
    id_cols <- setdiff(names(data),c(key_col,value_cols))

    for (i in seq_along(result)){
        result[[i]] <- spread(data = data[,c(id_cols,key_col,value_cols[i]),drop = FALSE],
                              key = !!key_col,
                              value = !!value_cols[i],
                              fill = fill,
                              convert = convert,
                              drop = drop,
                              sep = paste0(sep,value_cols[i],sep))
    }

    result %>%
        purrr::reduce(.f = full_join,by = id_cols)
}

> dat %>%
  spread_nq(key_col = grp,avg,sd,sep = "_")

# A tibble: 2 x 5
     id grp_avg_a  grp_avg_b  grp_sd_a  grp_sd_b
  <int>     <dbl>      <dbl>     <dbl>     <dbl>
1     1 1.3709584 -0.5646982 0.6569923 0.7050648
2     2 0.3631284  0.6328626 0.4577418 0.7191123

The change here is that we capture the unquoted arguments with rlang::quos and rlang::enquo and then simply convert them back to characters using tidyselect::vars_select.

Returning to the solution in the linked question that uses a sequence of gather, unite and spread, we can use what we've learned to make a function like this:

spread_nt <- function(data,key_col,...,fill = NA,
                      convert = TRUE,drop = TRUE,sep = "_"){
  key_quo <- rlang::enquo(key_col)
  val_quos <- rlang::quos(...)
  value_cols <- unname(tidyselect::vars_select(names(data),!!!val_quos))
  key_col <- unname(tidyselect::vars_select(names(data),!!key_quo))

  data %>%
    gather(key = ..var..,value = ..val..,!!!val_quos) %>%
    unite(col = ..grp..,c(key_col,"..var.."),sep = sep) %>%
    spread(key = ..grp..,value = ..val..,fill = fill,
           convert = convert,drop = drop,sep = NULL)
}

> dat %>%
  spread_nt(key_col = grp,avg,sd,sep = "_")

# A tibble: 2 x 5
     id     a_avg      a_sd      b_avg      b_sd
* <int>     <dbl>     <dbl>      <dbl>     <dbl>
1     1 1.3709584 0.6569923 -0.5646982 0.7050648
2     2 0.3631284 0.4577418  0.6328626 0.7191123

This relies on the same techniques from rlang from the last example. We're using some unusual names like ..var.. for our intermediate variables in order to reduce the chances of name collisions with existing columns in our data frame.

Also, we're using the sep argument in unite to control the resulting column names, so in this case when we spread we force sep = NULL.

like image 90
joran Avatar answered Dec 31 '22 08:12

joran