I have been reading about SE and NSE in dplyr, and have run into a problem where I actually need SE. I have the following function that is supposed to find rows where some items match, but the target variable doesn't:
find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
inconsists <- df %>%
group_by_at(cols_to_use) %>%
summarise(uTargets = length(unique(get(target_column)))) %>%
filter(uTargets > 1)
}
This seems to work in my case. However, the get(target_column) is a workaround because I need SE of my variable and cannot hardcode the column name. I initially tried to do it with the SE version (summarise_(.dots = ...)
), but had trouble finding the correct syntax for evaluating target_column.
My question is the following: Is there any downside to simply using get()
? Are the any cases where this will not work? Any risks / slowdowns? Simply using get
is definitely way more readable than the "correct" SE syntax.
To perform summarise on multiple columns, create a vector with the column names and use it with across() function.
As with any R function, you can think of functions in the dplyr package as verbs - that refer to performing a particular action on a data frame. The core dplyr functions are: rename() renames columns. filter() filters rows based on their values in specified columns.
dplyr aims to provide a function for each basic verb of data manipulation. These verbs can be organised into three categories based on the component of the dataset that they work with: Rows: filter() chooses rows based on column values.
The group_by() function in R is from dplyr package that is used to group rows by column values in the DataFrame, It is similar to GROUP BY clause in SQL. R dplyr groupby is used to collect identical data into groups on DataFrame and perform aggregate functions on the grouped data.
It can be done with NSE, with rlang
.
Assuming your use case is:
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
# # A tibble: 8 x 6
# # Groups: cyl, vs, am, gear [5]
# cyl vs am gear carb uTargets
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 4.00 1.00 0 4.00 2.00 2
# 2 4.00 1.00 1.00 4.00 1.00 4
# 3 4.00 1.00 1.00 4.00 2.00 2
# 4 6.00 1.00 0 3.00 1.00 2
# 5 6.00 1.00 0 4.00 4.00 2
# 6 8.00 0 0 3.00 2.00 4
# 7 8.00 0 0 3.00 3.00 3
# 8 8.00 0 0 3.00 4.00 4
You could:
library(dplyr)
f2 <- function(df, target_column, cols_to_use) {
group_by_at(df, cols_to_use) %>%
summarise(uTargets = n_distinct(!! rlang::sym(target_column))) %>%
filter(uTargets > 1)
}
all.equal(
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb")),
f2(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
)
# [1] TRUE
Actual answer to your question about risks:
Now imagine you have foo <- 3
in your global environment. Compare:
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
# A tibble: 0 x 6
# Groups: cyl, vs, am, gear [0]
# ... with 6 variables: cyl <dbl>, vs <dbl>, am <dbl>, gear <dbl>,
# carb <dbl>, uTargets <int>
which will silently return an empty data frame, and:
f2(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
# Error in summarise_impl(.data, dots) : variable 'foo' not found
which will raise an error that directly points you to the bug.
Edit
Since you seem to be after the "tidyverse way", I'd recommend the following. The underlying philosophy seems to be to discourage as much as possible the use of variables names as strings, and rather as bare names:
f3 <- function(df, target_column, ...) {
target_column <- enquo(target_column)
cols_to_use <- quos(...)
group_by(df, !!! cols_to_use) %>%
summarise(uTargets = n_distinct(!! target_column)) %>%
filter(uTargets > 1)
}
all.equal(
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb")),
f3(mtcars, target_column = mpg, cyl, vs, am, gear, carb)
)
# [1] TRUE
f3()
's interface is also designed to resemble that of other tidyverse functions, and potentially better integrate in a tidyverse pipeline of transformations.
@Aurele has already shown how to do it using rlang but I thought it would be interesting to see if we can get it working using get
as well. As pointed out my first few attempts at get
did not work but after some experimentation this seems to work as desired. This is not to say I am suggesting this but just for interest sake here it is.
If we wrap the summarize statement in do
then we can use get(..., .)
like this and it will work as desired. This is probably the easiest and most straight forward way to use get
within group by
. The key observation is that within do
the dot refers to those only rows within the current group whereas outside of do
it refers to all rows of the input when used in the actual argument to a nested function call.
find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
df %>%
group_by_at(cols_to_use) %>%
do(summarise(., uTargets = length(unique(get(target_column, .))))) %>%
filter(uTargets > 1)
}
# gives desired result
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
# ... snip correct output ...
# correctly gives an error indicating it can't find `foo`
foo <- 3
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
df %>%
group_by_at(cols_to_use) %>%
summarise(uTargets = length(unique(get(target_column,
parent.env(parent.env(environment())), inherits = FALSE)))) %>%
filter(uTargets > 1)
}
# gives desired result
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
# ... snip correct output ...
# correctly gives an error indicating it can't find `foo`
foo <- 3
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
## Error in summarise_impl(.data, dots) :
## Evaluation error: object 'foo' not found.
To make this solution a bit more streamlined we could define GET
like this:
GET <- function(x) {
p <- parent.frame()
p3 <- parent.env(parent.env(p))
get(x, p3, inherits = FALSE)
}
find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
df %>%
group_by_at(cols_to_use) %>%
summarise(uTargets = length(unique(GET(target_column)))) %>%
filter(uTargets > 1)
}
# gives expected answer
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
# gives expected error
foo <- 3
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
Another possibility would be to subset by a key column. mtcars
has no such column but if we make the row names into such a column then we would have one:
library(tidyr)
find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
df %>%
rownames_to_column %>%
group_by_at(cols_to_use) %>%
summarise(uTargets = length(unique(
get(target_column, .[.$rowname %in% rowname, ])))) %>%
filter(uTargets > 1)
}
# gives expected answer
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
# gives expected error
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With