Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr: access column name in mutate_at function

Tags:

r

dplyr

I would like to correct a column in a data.frame by subtracting from it another column with nearly identical name, but this other column has a suffix. I would like to use the mutate_at function for this.

Trying to figure this out, I have struggled to access the name of a column in the function part of mutate_at, to the use it to access the other column.

I show this in a small example below, but basically I would like to access the name of the column used at the moment . and then select from the data in the pipe a column that has the same name as . but with a suffix (below that would be "_new").

Thanks for your help!

Here is an example of how I would have liked to do it - but this does not work.

library(tidyverse)
data("mtcars")
new <- mtcars/4
names(new) <-paste0(names(new),"_new")

df <- bind_cols(mtcars,new)

df %>% 
  mutate_at(.vars = vars(carb,disp),
            .funs = list(corrected = ~ . - df %>% pull(paste0(names(.),"_new"))))

df %>% pull(paste0("carb","_new"))

like image 262
Moritz Schwarz Avatar asked May 20 '20 00:05

Moritz Schwarz


People also ask

How do I get the name of a specific column in R?

To select a column in R you can use brackets e.g., YourDataFrame['Column'] will take the column named “Column”. Furthermore, we can also use dplyr and the select() function to get columns by name or index. For instance, select(YourDataFrame, c('A', 'B') will take the columns named “A” and “B” from the dataframe.

What Dplyr function do you use to pick observations by their values?

6.4 dplyr basics filter() : pick observations by their values. select() : pick variables by their names. mutate() : create new variables with functions of existing variables. summarise() : collapse many values down to a single summary.


5 Answers

Instead of using mutate_at why not use mutate combined with across and cur_column i.e:

df %>% 
  mutate( across( c(carb,disp), ~ . - pull(df, paste0(cur_column(), "_new") ),  .names = "{.col}_corrected") )
like image 101
Abdessabour Mtk Avatar answered Sep 23 '22 23:09

Abdessabour Mtk


As the others already pointed out, variable names cannot be accessed in mutate_at and this holds also true for the coming mutate(across()). I addressed this issue as feature request for dplyr here, but apparently, this kind of data-wrangling task is too specialized for dplyr. Below I provide my favorite work-around for this kind of data-wrangling problems, which consists of two steps:

  1. define a custom mutate function using !! rlang::sym() to generate variable based on a character vector of variable names
  2. apply this custom function using purrr::reduce.
library(tidyverse)

# your toy data
df <- mtcars %>% 
         as_tibble %>% 
         mutate_all(list(new =~ ./4))

# step 1: generate helper function, in this case a simple `mutate` call

gen_corrected <- function(df, x) {

  mutate(df,
         "{x}_corrected" := !! rlang::sym(x) - !! rlang::sym(str_c(x, "_new"))
  )
}

# step 2:
# use purrr's `reduce` on the vector of vars you want to change
# the vector of variables can be defined in a separate step
# important: you need to set `.init = .`

df %>% 
  purrr::reduce(c('carb', 'disp'), gen_corrected, .init = .)
#> # A tibble: 32 x 24
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb mpg_new
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4    5.25
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4    5.25
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1    5.7 
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1    5.35
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2    4.68
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1    4.53
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4    3.58
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2    6.1 
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2    5.7 
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4    4.8 
#> # … with 22 more rows, and 12 more variables: cyl_new <dbl>, disp_new <dbl>,
#> #   hp_new <dbl>, drat_new <dbl>, wt_new <dbl>, qsec_new <dbl>, vs_new <dbl>,
#> #   am_new <dbl>, gear_new <dbl>, carb_new <dbl>, carb_corrected <dbl>,
#> #   disp_corrected <dbl>

Created on 2020-05-21 by the reprex package (v0.3.0)

In the github issue mention above @Romain Francois provides another work-around to this issue.

like image 36
TimTeaFan Avatar answered Sep 25 '22 23:09

TimTeaFan


We can't access the names of the . inside mutate_at because it is a vector and doesn't have the column name info. One option is map2

library(purrr)
library(dplyr)
library(stringr)
nm1 <- c('carb', 'disp')
map_dfc(nm1, ~ df %>%
                transmute(!!str_c(.x, '_corrected') := 
                    !! rlang::sym(.x) - !! rlang::sym(str_c(.x, "_new"))))%>%
   bind_cols(df, .) %>%
   head
# mpg cyl disp  hp drat    wt  qsec vs am gear carb mpg_new cyl_new disp_new hp_new drat_new  wt_new qsec_new
#1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   5.250     1.5    40.00  27.50   0.9750 0.65500   4.1150
#2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4   5.250     1.5    40.00  27.50   0.9750 0.71875   4.2550
#3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1   5.700     1.0    27.00  23.25   0.9625 0.58000   4.6525
#4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1   5.350     1.5    64.50  27.50   0.7700 0.80375   4.8600
#5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2   4.675     2.0    90.00  43.75   0.7875 0.86000   4.2550
#6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1   4.525     1.5    56.25  26.25   0.6900 0.86500   5.0550
#  vs_new am_new gear_new carb_new carb_corrected disp_corrected
#1   0.00   0.25     1.00     1.00           3.00         120.00
#2   0.00   0.25     1.00     1.00           3.00         120.00
#3   0.25   0.25     1.00     0.25           0.75          81.00
#4   0.25   0.00     0.75     0.25           0.75         193.50
#5   0.00   0.00     0.75     0.50           1.50         270.00
#6   0.25   0.00     0.75     0.25           0.75         168.75

Or we can update with a for loop

for(nm in nm1) {
            df <- df %>%
                    mutate(!! str_c(nm, '_corrected') := 
                          !! rlang::sym(nm) - !! rlang::sym(str_c(nm, '_new')))
  }

Or using base R with lapply

 df[paste0(nm1, "_corrected")] <- lapply(nm1, function(nm)
                   df[nm] - df[paste0(nm, "_new")])
like image 25
akrun Avatar answered Sep 24 '22 23:09

akrun


You can use Map in base R or map2 from purrr for this :

cols <- c('carb', 'disp')
df[paste0(cols, '_corrected')] <- Map(`-`, df[cols], df[paste0(cols, '_new')])

Using map2

library(purrr)
df[paste0(cols, '_corrected')] <- map2(df[cols], df[paste0(cols, '_new')], `-`)
like image 44
Ronak Shah Avatar answered Sep 25 '22 23:09

Ronak Shah


Here is a base R solution. We can define a function, diff_col, that creates the operation and then use a for loop to repeatedly apply this function to modify the data frame.

diff_col <- function(col, dat){
  dat[[paste0(col, "_corrected")]] <- dat[[col]] - dat[[paste0(col, "_new")]]
  return(dat)
}

for (name in c("carb", "disp")){
  df <- diff_col(col = name, dat = df)
} 
like image 24
www Avatar answered Sep 25 '22 23:09

www