Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a downside to using get() in dplyr instead of SE?

Tags:

r

dplyr

I have been reading about SE and NSE in dplyr, and have run into a problem where I actually need SE. I have the following function that is supposed to find rows where some items match, but the target variable doesn't:

find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
  inconsists <- df %>% 
    group_by_at(cols_to_use) %>% 
    summarise(uTargets = length(unique(get(target_column)))) %>% 
    filter(uTargets > 1)
}

This seems to work in my case. However, the get(target_column) is a workaround because I need SE of my variable and cannot hardcode the column name. I initially tried to do it with the SE version (summarise_(.dots = ...)), but had trouble finding the correct syntax for evaluating target_column.

My question is the following: Is there any downside to simply using get()? Are the any cases where this will not work? Any risks / slowdowns? Simply using get is definitely way more readable than the "correct" SE syntax.

like image 340
Thomas Avatar asked Jan 09 '18 09:01

Thomas


People also ask

How do I summarize multiple columns in R?

To perform summarise on multiple columns, create a vector with the column names and use it with across() function.

Can you use dplyr in a function?

As with any R function, you can think of functions in the dplyr package as verbs - that refer to performing a particular action on a data frame. The core dplyr functions are: rename() renames columns. filter() filters rows based on their values in specified columns.

Why do we use dplyr in R?

dplyr aims to provide a function for each basic verb of data manipulation. These verbs can be organised into three categories based on the component of the dataset that they work with: Rows: filter() chooses rows based on column values.

How does group_by work in R?

The group_by() function in R is from dplyr package that is used to group rows by column values in the DataFrame, It is similar to GROUP BY clause in SQL. R dplyr groupby is used to collect identical data into groups on DataFrame and perform aggregate functions on the grouped data.


2 Answers

It can be done with NSE, with rlang.

Assuming your use case is:

find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
# # A tibble: 8 x 6
# # Groups:   cyl, vs, am, gear [5]
#     cyl    vs    am  gear  carb uTargets
#   <dbl> <dbl> <dbl> <dbl> <dbl>    <int>
# 1  4.00  1.00  0     4.00  2.00        2
# 2  4.00  1.00  1.00  4.00  1.00        4
# 3  4.00  1.00  1.00  4.00  2.00        2
# 4  6.00  1.00  0     3.00  1.00        2
# 5  6.00  1.00  0     4.00  4.00        2
# 6  8.00  0     0     3.00  2.00        4
# 7  8.00  0     0     3.00  3.00        3
# 8  8.00  0     0     3.00  4.00        4

You could:

library(dplyr)

f2 <- function(df, target_column, cols_to_use) {
  group_by_at(df, cols_to_use) %>% 
    summarise(uTargets = n_distinct(!! rlang::sym(target_column))) %>% 
    filter(uTargets > 1)
}

all.equal(
  find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb")),
  f2(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
)
# [1] TRUE

Actual answer to your question about risks:

Now imagine you have foo <- 3 in your global environment. Compare:

find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
# A tibble: 0 x 6
# Groups:   cyl, vs, am, gear [0]
# ... with 6 variables: cyl <dbl>, vs <dbl>, am <dbl>, gear <dbl>,
#   carb <dbl>, uTargets <int>

which will silently return an empty data frame, and:

f2(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
# Error in summarise_impl(.data, dots) : variable 'foo' not found

which will raise an error that directly points you to the bug.


Edit

Since you seem to be after the "tidyverse way", I'd recommend the following. The underlying philosophy seems to be to discourage as much as possible the use of variables names as strings, and rather as bare names:

f3 <- function(df, target_column, ...) {
  target_column <- enquo(target_column)
  cols_to_use <- quos(...)
  group_by(df, !!! cols_to_use) %>% 
    summarise(uTargets = n_distinct(!! target_column)) %>% 
    filter(uTargets > 1)
}
all.equal(
  find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb")),
  f3(mtcars, target_column = mpg, cyl, vs, am, gear, carb)
)
# [1] TRUE

f3()'s interface is also designed to resemble that of other tidyverse functions, and potentially better integrate in a tidyverse pipeline of transformations.

like image 103
Aurèle Avatar answered Oct 21 '22 01:10

Aurèle


@Aurele has already shown how to do it using rlang but I thought it would be interesting to see if we can get it working using get as well. As pointed out my first few attempts at get did not work but after some experimentation this seems to work as desired. This is not to say I am suggesting this but just for interest sake here it is.

1. get/do

If we wrap the summarize statement in do then we can use get(..., .) like this and it will work as desired. This is probably the easiest and most straight forward way to use get within group by. The key observation is that within do the dot refers to those only rows within the current group whereas outside of do it refers to all rows of the input when used in the actual argument to a nested function call.

find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
  df %>% 
    group_by_at(cols_to_use) %>% 
    do(summarise(., uTargets = length(unique(get(target_column, .))))) %>% 
    filter(uTargets > 1)
}

# gives desired result
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
# ... snip correct output ...

# correctly gives an error indicating it can't find `foo`
foo <- 3
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))

2. get reaching into grandparent with inherits=FALSE

find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
  df %>% 
    group_by_at(cols_to_use) %>% 
    summarise(uTargets = length(unique(get(target_column,
       parent.env(parent.env(environment())), inherits = FALSE)))) %>% 
    filter(uTargets > 1)
}

# gives desired result
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
# ... snip correct output ...

# correctly gives an error indicating it can't find `foo`
foo <- 3
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
## Error in summarise_impl(.data, dots) : 
##   Evaluation error: object 'foo' not found.

To make this solution a bit more streamlined we could define GET like this:

GET <- function(x) {
  p <- parent.frame()
  p3 <- parent.env(parent.env(p))
  get(x, p3, inherits = FALSE)
}

find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
  df %>% 
    group_by_at(cols_to_use) %>% 
    summarise(uTargets = length(unique(GET(target_column)))) %>% 
    filter(uTargets > 1)
}

# gives expected answer    
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))

# gives expected error
foo <- 3
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))

3. subset by key column

Another possibility would be to subset by a key column. mtcars has no such column but if we make the row names into such a column then we would have one:

library(tidyr)
find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
  df %>% 
    rownames_to_column %>%
    group_by_at(cols_to_use) %>% 
    summarise(uTargets = length(unique(
        get(target_column, .[.$rowname %in% rowname, ])))) %>% 
    filter(uTargets > 1)
}

# gives expected answer
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))

# gives expected error
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
like image 41
G. Grothendieck Avatar answered Oct 21 '22 02:10

G. Grothendieck