Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr 0.7.5 change in select() behavior

Tags:

r

dplyr

select() in dplyr 0.7.5 returns a different result from dplyr 0.7.4 when using a named vector to specify columns.

library(dplyr)                               
df <- data.frame(a = 1:5, b = 6:10, c = 11:15)
print(df)                                     
#>   a  b  c
#> 1 1  6 11
#> 2 2  7 12
#> 3 3  8 13
#> 4 4  9 14
#> 5 5 10 15

# a named vector
cols <- c(x = 'a', y = 'b', z = 'c')          
print(cols)                                   
#>  x   y   z 
#> "a" "b" "c"

# with dplyr 0.7.4
# returns column names with vector values
select(df, cols)                              
#>   a  b  c
#> 1 1  6 11
#> 2 2  7 12
#> 3 3  8 13
#> 4 4  9 14
#> 5 5 10 15

# with dplyr 0.7.5
# returns column names with vector names
select(df, cols)                              
#>   x  y  z
#> 1 1  6 11
#> 2 2  7 12
#> 3 3  8 13
#> 4 4  9 14
#> 5 5 10 15

Is this a bug or a feature?

like image 883
Evan Antworth Avatar asked Jun 07 '18 18:06

Evan Antworth


1 Answers

IMO it could have been considered a bug in 0.7.4, and is now fixed / more user-friendly.

With the move to tidyselect, the logic has become a little more sophisticated. If you compare dplyr::select_vars to the new tidyselect::vars_select (these are the variants used by dplyr:::select.data.frame in 0.7.4 and 0.7.5 respectively), you can find that the line below was losing the names for the named & quoted (strings) case in 0.7.4:

ind_list <- map_if(ind_list, is_character, match_var, table = vars)

# example:
dplyr:::select.data.frame(mtcars, c(a = "mpg", b = "disp"))

Note that this is not an issue of named vectors in general, as the typical unquoted case was always fine:

dplyr:::select.data.frame(mtcars, c(a = mpg, b = disp))
# (here the names are indeed "a" and "b" afterwards)

There is a line of code that handles the usage of c():

ind_list <- map_if(ind_list, !is_helper, eval_tidy, data = names_list)

eval_tidy is from the rlang package, and in the line above would return the following for the problematic call:

[[1]]
 a      b 
 "mpg" "disp" 

Now with tidyselect, we have some extra handling, see https://github.com/tidyverse/tidyselect/blob/master/R/vars-select.R.

In particular, vars_select_eval has the following line, where it is handling the usage of c():

ind_list <- map_if(quos, !is_helper, overscope_eval_next, overscope = overscope)

overscope_eval_next is again from the rlang package and calls the same routine as eval_tidy would, but it receives an overscope variant of c() that handles strings (through the overscope argument). See tidyselect:::vars_c. So after this line, the c(a = "mpg", b = "disp") case becomes the same as c(a = mpg, b = disp):

[[1]]
a b   # these are the names
1 3   # these are the positions of the selected cols

is_character then does not hold anymore in subsequent code, as opposed to above with rlang::eval_tidy.

In case you look at these functions in rlang, the fact that overscope_eval_next is soft-deprecated in favor of eval_tidy might confuse you given the above. But here I guess that tidyselect just hasn't been "cleaned up" wrt this yet (naming inconsistencies etc would have to be addressed as well, so it's a re-write of more than just the one line with the call). But in the end eval_tidy can be used in the same way now and probably will be.

like image 104
RolandASc Avatar answered Sep 28 '22 05:09

RolandASc