Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is mutate_at not creating a different name for new column when I pass it only one column in vars()?

Tags:

dataframe

r

dplyr

I noticed an unexpected behavior of mutate_at. Suppose I have a data frame and a list of columns I want to mutate, like:

df1 <- data_frame(var1 = c(1,2,3,4,5,6),
                  var2 = c(1,1,1,2,2,2),
                  var3 = c(10,30,50,70,90,110))
variables <- c("var1", "var2")

I now apply mutate_at to create new factor versions of the columns defined in variables. By specifying "cat" in list, I am making sure the old versions are kept, and the new versions have the name of the old version plus "_cat":

df1 %>% mutate_at(vars(variables), .funs = list(cat = as.factor))
# A tibble: 6 x 5
   var1  var2  var3 var1_cat var2_cat
  <dbl> <dbl> <dbl> <fct>    <fct>   
1     1     1    10 1        1       
2     2     1    30 2        1       
3     3     1    50 3        1       
4     4     2    70 4        2       
5     5     2    90 5        2       
6     6     2   110 6        2  

However, if I apply mutate_at to only one column (in my case, my variables vector has only one element), the name of the new variable is only "cat":

variables <- c("var1")
df1 %>% mutate_at(vars(variables), .funs = list(cat = as.factor))
# A tibble: 6 x 4
   var1  var2  var3 cat  
  <dbl> <dbl> <dbl> <fct>
1     1     1    10 1    
2     2     1    30 2    
3     3     1    50 3    
4     4     2    70 4    
5     5     2    90 5    
6     6     2   110 6  

On some level, I understand why mutate_at is doing this: If you want to name one mutated column in any special way, just use mutate like mutate(var1_cat = as.factor(var1)).

However, in my case, I want to run the mutate_at operation over a number of data frames, for each of which I have a vector of columns to change. Crucially, these vectors might have only one element. So, would it not be better for mutate_at to show the same naming behavior no matter how many vars it receives?

like image 698
broti Avatar asked Dec 14 '22 09:12

broti


2 Answers

I don't think this is the expected behavior (or at least shouldn't be), and the good news is that the newest version of dplyr gets rid of this behavior. Currently you can install it using remotes::install_github('tidyverse/dplyr'), but should be on CRAN in the coming month or 2.

mutate_at (and other scoped verbs like mutate_if, summarize_all, etc.) has been replaced by the use of across within existing verbs, and this provides the behavior you are looking for.

library(dplyr)

variables <- c("var1", "var2")

df1 %>% 
  mutate(across(all_of(variables), .fns = list(cat = as.factor)))
#> # A tibble: 6 x 5
#>    var1  var2  var3 var1_cat var2_cat
#>   <dbl> <dbl> <dbl> <fct>    <fct>   
#> 1     1     1    10 1        1       
#> 2     2     1    30 2        1       
#> 3     3     1    50 3        1       
#> 4     4     2    70 4        2       
#> 5     5     2    90 5        2       
#> 6     6     2   110 6        2

variables <- c("var1")

df1 %>% 
  mutate(across(all_of(variables), .fns = list(cat = as.factor)))
#> # A tibble: 6 x 4
#>    var1  var2  var3 var1_cat
#>   <dbl> <dbl> <dbl> <fct>   
#> 1     1     1    10 1       
#> 2     2     1    30 2       
#> 3     3     1    50 3       
#> 4     4     2    70 4       
#> 5     5     2    90 5       
#> 6     6     2   110 6

Session info

sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17763)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United Kingdom.1252 
#> [2] LC_CTYPE=English_United Kingdom.1252   
#> [3] LC_MONETARY=English_United Kingdom.1252
#> [4] LC_NUMERIC=C                           
#> [5] LC_TIME=English_United Kingdom.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_0.8.99.9001
#>
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.3        knitr_1.28        magrittr_1.5      tidyselect_1.0.0 
#>  [5] R6_2.4.1          rlang_0.4.5.9000  fansi_0.4.1       stringr_1.4.0    
#>  [9] highr_0.8         tools_3.6.3       xfun_0.12         utf8_1.1.4       
#> [13] cli_2.0.2         htmltools_0.4.0   ellipsis_0.3.0    assertthat_0.2.1 
#> [17] yaml_2.2.1        digest_0.6.25     tibble_2.1.3      lifecycle_0.2.0  
#> [21] crayon_1.3.4      purrr_0.3.3       vctrs_0.2.99.9010 glue_1.3.2       
#> [25] evaluate_0.14     rmarkdown_2.1     stringi_1.4.6     compiler_3.6.3   
#> [29] pillar_1.4.3      pkgconfig_2.0.3
like image 86
caldwellst Avatar answered Mar 23 '23 00:03

caldwellst


Not sure if there is an easy solution to this.

However, one way would be to apply the function based on length of variables.

library(dplyr)

if (length(variables) > 1) {
   df1 %>% mutate_at(vars(variables), list(cat = as.factor))
} else {
   df1 %>% mutate(!!paste0(variables, "_cat") := as.factor(!!sym(variables)))
}
like image 31
Ronak Shah Avatar answered Mar 22 '23 22:03

Ronak Shah