I have been reading about SE and NSE in dplyr, and have run into a problem where I actually need SE. I have the following function that is supposed to find rows where some items match, but the target variable doesn't: <pre class="prettyprint"><code>find_dataset_inconsistencies <- function(df, target_column, cols_to_use) { inconsists <- df %>% group_by_at(cols_to_use) %>% summarise(uTargets = length(unique(get(target_column)))) %>% filter(uTargets > 1) } </code></pre> This seems to work in my case. However, the get(target_column) is a workaround because I need SE of my variable and cannot hardcode the column name. I initially tried to do it with the SE version (<code>summarise_(.dots = ...)</code>), but had trouble finding the correct syntax for evaluating target_column. My question is the following: Is there any downside to simply using <code>get()</code>? Are the any cases where this will not work? Any risks / slowdowns? Simply using <code>get</code> is definitely way more readable than the "correct" SE syntax.

It can be done with NSE, with <code>rlang</code>. Assuming your use case is: <pre class="prettyprint"><code>find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb")) # # A tibble: 8 x 6 # # Groups: cyl, vs, am, gear [5] # cyl vs am gear carb uTargets # <dbl> <dbl> <dbl> <dbl> <dbl> <int> # 1 4.00 1.00 0 4.00 2.00 2 # 2 4.00 1.00 1.00 4.00 1.00 4 # 3 4.00 1.00 1.00 4.00 2.00 2 # 4 6.00 1.00 0 3.00 1.00 2 # 5 6.00 1.00 0 4.00 4.00 2 # 6 8.00 0 0 3.00 2.00 4 # 7 8.00 0 0 3.00 3.00 3 # 8 8.00 0 0 3.00 4.00 4 </code></pre> You could: <pre class="prettyprint"><code>library(dplyr) f2 <- function(df, target_column, cols_to_use) { group_by_at(df, cols_to_use) %>% summarise(uTargets = n_distinct(!! rlang::sym(target_column))) %>% filter(uTargets > 1) } all.equal( find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb")), f2(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb")) ) # [1] TRUE </code></pre> <hr> Actual answer to your question about risks: Now imagine you have <code>foo <- 3</code> in your global environment. Compare: <pre class="prettyprint"><code>find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb")) # A tibble: 0 x 6 # Groups: cyl, vs, am, gear [0] # ... with 6 variables: cyl <dbl>, vs <dbl>, am <dbl>, gear <dbl>, # carb <dbl>, uTargets <int> </code></pre> which will silently return an empty data frame, and: <pre class="prettyprint"><code>f2(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb")) # Error in summarise_impl(.data, dots) : variable 'foo' not found </code></pre> which will raise an error that directly points you to the bug. <hr> Edit Since you seem to be after the "tidyverse way", I'd recommend the following. The underlying philosophy seems to be to discourage as much as possible the use of variables names as strings, and rather as bare names: <pre class="prettyprint"><code>f3 <- function(df, target_column, ...) { target_column <- enquo(target_column) cols_to_use <- quos(...) group_by(df, !!! cols_to_use) %>% summarise(uTargets = n_distinct(!! target_column)) %>% filter(uTargets > 1) } all.equal( find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb")), f3(mtcars, target_column = mpg, cyl, vs, am, gear, carb) ) # [1] TRUE </code></pre> <code>f3()</code>'s interface is also designed to resemble that of other tidyverse functions, and potentially better integrate in a tidyverse pipeline of transformations.

@Aurele has already shown how to do it using rlang but I thought it would be interesting to see if we can get it working using <code>get</code> as well. As pointed out my first few attempts at <code>get</code> did not work but after some experimentation this seems to work as desired. This is not to say I am suggesting this but just for interest sake here it is. <h3>1. get/do</h3> If we wrap the summarize statement in <code>do</code> then we can use <code>get(..., .)</code> like this and it will work as desired. This is probably the easiest and most straight forward way to use <code>get</code> within <code>group by</code>. The key observation is that within <code>do</code> the dot refers to those only rows within the current group whereas outside of <code>do</code> it refers to all rows of the input when used in the actual argument to a nested function call. <pre class="prettyprint"><code>find_dataset_inconsistencies <- function(df, target_column, cols_to_use) { df %>% group_by_at(cols_to_use) %>% do(summarise(., uTargets = length(unique(get(target_column, .))))) %>% filter(uTargets > 1) } # gives desired result find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb")) # ... snip correct output ... # correctly gives an error indicating it can't find `foo` foo <- 3 find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb")) </code></pre> <h3>2. get reaching into grandparent with inherits=FALSE</h3> <pre class="prettyprint"><code>find_dataset_inconsistencies <- function(df, target_column, cols_to_use) { df %>% group_by_at(cols_to_use) %>% summarise(uTargets = length(unique(get(target_column, parent.env(parent.env(environment())), inherits = FALSE)))) %>% filter(uTargets > 1) } # gives desired result find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb")) # ... snip correct output ... # correctly gives an error indicating it can't find `foo` foo <- 3 find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb")) ## Error in summarise_impl(.data, dots) : ## Evaluation error: object 'foo' not found. </code></pre> To make this solution a bit more streamlined we could define <code>GET</code> like this: <pre class="prettyprint"><code>GET <- function(x) { p <- parent.frame() p3 <- parent.env(parent.env(p)) get(x, p3, inherits = FALSE) } find_dataset_inconsistencies <- function(df, target_column, cols_to_use) { df %>% group_by_at(cols_to_use) %>% summarise(uTargets = length(unique(GET(target_column)))) %>% filter(uTargets > 1) } # gives expected answer find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb")) # gives expected error foo <- 3 find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb")) </code></pre> <h3>3. subset by key column</h3> Another possibility would be to subset by a key column. <code>mtcars</code> has no such column but if we make the row names into such a column then we would have one: <pre class="prettyprint"><code>library(tidyr) find_dataset_inconsistencies <- function(df, target_column, cols_to_use) { df %>% rownames_to_column %>% group_by_at(cols_to_use) %>% summarise(uTargets = length(unique( get(target_column, .[.$rowname %in% rowname, ])))) %>% filter(uTargets > 1) } # gives expected answer find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb")) # gives expected error find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb")) </code></pre>

Is there a downside to using get() in dplyr instead of SE?

Tags:

r

dplyr

I have been reading about SE and NSE in dplyr, and have run into a problem where I actually need SE. I have the following function that is supposed to find rows where some items match, but the target variable doesn't:

find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
  inconsists <- df %>% 
    group_by_at(cols_to_use) %>% 
    summarise(uTargets = length(unique(get(target_column)))) %>% 
    filter(uTargets > 1)
}

This seems to work in my case. However, the get(target_column) is a workaround because I need SE of my variable and cannot hardcode the column name. I initially tried to do it with the SE version (summarise_(.dots = ...)), but had trouble finding the correct syntax for evaluating target_column.

My question is the following: Is there any downside to simply using get()? Are the any cases where this will not work? Any risks / slowdowns? Simply using get is definitely way more readable than the "correct" SE syntax.

340

asked Jan 09 '18 09:01

Thomas

2 Answers

It can be done with NSE, with rlang.

Assuming your use case is:

find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
# # A tibble: 8 x 6
# # Groups:   cyl, vs, am, gear [5]
#     cyl    vs    am  gear  carb uTargets
#   <dbl> <dbl> <dbl> <dbl> <dbl>    <int>
# 1  4.00  1.00  0     4.00  2.00        2
# 2  4.00  1.00  1.00  4.00  1.00        4
# 3  4.00  1.00  1.00  4.00  2.00        2
# 4  6.00  1.00  0     3.00  1.00        2
# 5  6.00  1.00  0     4.00  4.00        2
# 6  8.00  0     0     3.00  2.00        4
# 7  8.00  0     0     3.00  3.00        3
# 8  8.00  0     0     3.00  4.00        4

You could:

library(dplyr)

f2 <- function(df, target_column, cols_to_use) {
  group_by_at(df, cols_to_use) %>% 
    summarise(uTargets = n_distinct(!! rlang::sym(target_column))) %>% 
    filter(uTargets > 1)
}

all.equal(
  find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb")),
  f2(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
)
# [1] TRUE

Actual answer to your question about risks:

Now imagine you have foo <- 3 in your global environment. Compare:

find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
# A tibble: 0 x 6
# Groups:   cyl, vs, am, gear [0]
# ... with 6 variables: cyl <dbl>, vs <dbl>, am <dbl>, gear <dbl>,
#   carb <dbl>, uTargets <int>

which will silently return an empty data frame, and:

f2(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
# Error in summarise_impl(.data, dots) : variable 'foo' not found

which will raise an error that directly points you to the bug.

Edit

Since you seem to be after the "tidyverse way", I'd recommend the following. The underlying philosophy seems to be to discourage as much as possible the use of variables names as strings, and rather as bare names:

f3 <- function(df, target_column, ...) {
  target_column <- enquo(target_column)
  cols_to_use <- quos(...)
  group_by(df, !!! cols_to_use) %>% 
    summarise(uTargets = n_distinct(!! target_column)) %>% 
    filter(uTargets > 1)
}
all.equal(
  find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb")),
  f3(mtcars, target_column = mpg, cyl, vs, am, gear, carb)
)
# [1] TRUE

f3()'s interface is also designed to resemble that of other tidyverse functions, and potentially better integrate in a tidyverse pipeline of transformations.

103

answered Oct 21 '22 01:10

Aurèle

@Aurele has already shown how to do it using rlang but I thought it would be interesting to see if we can get it working using get as well. As pointed out my first few attempts at get did not work but after some experimentation this seems to work as desired. This is not to say I am suggesting this but just for interest sake here it is.

1. get/do

If we wrap the summarize statement in do then we can use get(..., .) like this and it will work as desired. This is probably the easiest and most straight forward way to use get within group by. The key observation is that within do the dot refers to those only rows within the current group whereas outside of do it refers to all rows of the input when used in the actual argument to a nested function call.

find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
  df %>% 
    group_by_at(cols_to_use) %>% 
    do(summarise(., uTargets = length(unique(get(target_column, .))))) %>% 
    filter(uTargets > 1)
}

# gives desired result
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
# ... snip correct output ...

# correctly gives an error indicating it can't find `foo`
foo <- 3
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))

2. get reaching into grandparent with inherits=FALSE

find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
  df %>% 
    group_by_at(cols_to_use) %>% 
    summarise(uTargets = length(unique(get(target_column,
       parent.env(parent.env(environment())), inherits = FALSE)))) %>% 
    filter(uTargets > 1)
}

# gives desired result
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
# ... snip correct output ...

# correctly gives an error indicating it can't find `foo`
foo <- 3
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
## Error in summarise_impl(.data, dots) : 
##   Evaluation error: object 'foo' not found.

To make this solution a bit more streamlined we could define GET like this:

GET <- function(x) {
  p <- parent.frame()
  p3 <- parent.env(parent.env(p))
  get(x, p3, inherits = FALSE)
}

find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
  df %>% 
    group_by_at(cols_to_use) %>% 
    summarise(uTargets = length(unique(GET(target_column)))) %>% 
    filter(uTargets > 1)
}

# gives expected answer    
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))

# gives expected error
foo <- 3
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))

3. subset by key column

Another possibility would be to subset by a key column. mtcars has no such column but if we make the row names into such a column then we would have one:

library(tidyr)
find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
  df %>% 
    rownames_to_column %>%
    group_by_at(cols_to_use) %>% 
    summarise(uTargets = length(unique(
        get(target_column, .[.$rowname %in% rowname, ])))) %>% 
    filter(uTargets > 1)
}

# gives expected answer
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))

# gives expected error
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))

answered Oct 21 '22 02:10

G. Grothendieck

Related questions
                            
                                Saving networkD3 Sankey diagram using code only
                            
                                Shiny & networkD3 responding to node click
                            
                                using position_dodge with geom_pointrange
                            
                                How to detect if bare variable or string
                            
                                Reduce spacing between columns in table created with kable(, format = 'markdown')
                            
                                Mutate each row in group according to the first row of the group
                            
                                ggplot and grid: Find the relative x and y positions of a point in a ggplot grob
                            
                                how to pass a named vector or two vectors as arguments to dplyr::recode
                            
                                Meaning of dot in lm(y~.) in R [duplicate]
                            
                                Create a unique legend based on a contingency (2x2) table in geom_map or ggplot2?
                            
                                Unquote the variable name on the right side of mutate function in dplyr
                            
                                How can I insert an image from internet to the pdf file produced by R bookdown in a smart way?
                            
                                Modify networkD3 sankey plot with user-defined colors
                            
                                Passing parameters into function that uses dplyr
                            
                                R dataframe error - Replacement has 1 row, data has 0
                            
                                crayon in R Markdown / knitr reports
                            
                                Use function argument as name for new data frame in R
                            
                                R predict warning
                            
                                R: reorder factor levels for several individual plots
                            
                                Rmarkdown HTML Template produces pandoc error 61

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With