To streamline data wrangling, I write a wrapper function consisted of several "verb functions" that process the data. Each one performs one task on the data. However, not all tasks are applicable to all datasets that pass through this process, and sometimes, for certain data, I might want to switch off some "verb functions", and skip them.
I'm trying to understand whether there's a conventional/canonical way to build such workflow within a wrapper function in R. Importantly, a way that will be efficient, both performance-wise and concise code.
As part of data wrangling, I want to carry out several steps:
janitor::clean_names()
)TRUE
and FALSE
are replaced with 1
and 0
(using gsub()
).tolower()
).id
column (using tidyr::pivot_wider
)NA
values (using dplyr::drop_na()
)Toy data
library(stringi)
library(tidyr)
set.seed(2021)
# simulate data
df <-
data.frame(id = 1:20,
isMale = rep(c("true", "false"), times = 10),
WEIGHT = sample(50:100, 20),
hash_Numb = stri_rand_strings(20, 5)) %>%
cbind(., score = sample(200:800, size = 20))
# sprinkle NAs randomly
df[c("isMale", "WEIGHT", "hash_Numb", "score")] <-
lapply(df[c("isMale", "WEIGHT", "hash_Numb", "score")], function(x) {
x[sample(seq_along(x), 0.25 * length(x))] <- NA
x
})
df <-
df %>%
tidyr::expand_grid(., Condition = c("A","B"))
df
#> # A tibble: 40 x 6
#> id isMale WEIGHT hash_Numb score Condition
#> <int> <chr> <int> <chr> <int> <chr>
#> 1 1 <NA> 56 EvRAq NA A
#> 2 1 <NA> 56 EvRAq NA B
#> 3 2 false 87 <NA> 322 A
#> 4 2 false 87 <NA> 322 B
#> 5 3 true 95 13pXe 492 A
#> 6 3 true 95 13pXe 492 B
#> 7 4 <NA> 88 4WMBS 626 A
#> 8 4 <NA> 88 4WMBS 626 B
#> 9 5 true NA Nrl1W 396 A
#> 10 5 true NA Nrl1W 396 B
#> # ... with 30 more rows
Created on 2021-03-03 by the reprex package (v0.3.0)
The data shows test scores of 20 people who took a test under two conditions. For each person we also know the gender (isMale
), the weight in kilograms(WEIGHT
), and a unique hash_number
.
Data cleanup and wrangling
Before this data is sent to analysis, it needs to be cleaned up, according to a certain chain of steps, which I laid out above.
library(janitor)
library(dplyr)
# helper function
convert_true_false_to_1_0 <- function(x) {
first_pass <- gsub("^(?:TRUE)$", 1, x, ignore.case = TRUE)
gsub("^(?:FALSE)$", 0, first_pass, ignore.case = TRUE)
}
# chain of steps
df %>%
janitor::clean_names() %>%
mutate(across(everything(), convert_true_false_to_1_0)) %>%
mutate(across(everything(), tolower)) %>%
pivot_wider(names_from = condition, values_from = score) %>%
drop_na()
My Question: How to pack this process in a wrapper that allows to flexibly switch some steps off?
One idea I have in my mind is to use a %>%
pipe with conditionals such as:
my_wrangling_wrapper <- function(dat,
clean_names = TRUE,
convert_tf_to_1_0 = TRUE,
convert_to_lower = TRUE,
pivot_widr = TRUE,
drp_na = TRUE){
dat %>%
{if (clean_names) janitor::clean_names(.) else .} %>%
{if (convert_tf_to_1_0) mutate(., across(everything(), convert_true_false_to_1_0)) else .} %>%
{if (convert_to_lower) mutate(., across(everything(), tolower)) else .} %>%
{if (pivot_widr) pivot_wider(., names_from = condition, values_from = score) else .} %>%
{if (drp_na) drop_na(.) else .}
}
This way, all steps are defaulted to happen, unless turned off:
> my_wrangling_wrapper(dat = df)
## # A tibble: 6 x 6
## id is_male weight hash_numb a b
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 3 1 95 13pxe 492 492
## 2 9 1 54 hgzxp 519 519
## 3 12 0 72 vwetc 446 446
## 4 15 1 52 qadxc 501 501
## 5 17 1 71 g42vg 756 756
## 6 18 0 80 qiejd 712 712
true
/false
to 1
/0
and don't drop NA
s:> my_wrangling_wrapper(dat = df, convert_tf_to_1_0 = FALSE, drp_na = FALSE)
## # A tibble: 20 x 6
## id is_male weight hash_numb a b
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 NA 56 evraq NA NA
## 2 2 false 87 NA 322 322
## 3 3 true 95 13pxe 492 492
## 4 4 NA 88 4wmbs 626 626
## 5 5 true NA nrl1w 396 396
## 6 6 false NA 4oq74 386 386
## 7 7 true NA gg23f NA NA
## 8 8 false 94 NA NA NA
## 9 9 true 54 hgzxp 519 519
## 10 10 false 97 NA 371 371
## 11 11 true 90 NA 768 768
## 12 12 false 72 vwetc 446 446
## 13 13 NA NA jkhjh 338 338
## 14 14 false NA 0swem 778 778
## 15 15 true 52 qadxc 501 501
## 16 16 false 75 NA 219 219
## 17 17 true 71 g42vg 756 756
## 18 18 false 80 qiejd 712 712
## 19 19 NA 68 tadad NA NA
## 20 20 NA 53 iyw3o NA NA
Although the solution I came up with does work, I've learned that relying on the pipe operator is not advised within functions, because it slows down the process (see reference). Also, since %>%
is not part of base R
, there has to be a way to achieve the same "tweakable wrapping" functionality without the pipe. So I wonder: is there a conventional way to write a wrapper function that could be tweaked to turn off some of its components, and still overall remain performance-efficient?
{It's worth mentioning that I've asked a similar question regarding building a wrapper for ggplot
, turning geoms
off as desired. The answer was great but not applicable to the current question.}
Staying with %>%
, you could create a functional sequence:
library(magrittr)
my_wrangling_wrapper =
. %>%
janitor::clean_names() %>%
mutate(across(everything(), convert_true_false_to_1_0)) %>%
mutate(across(everything(), tolower)) %>%
pivot_wider(names_from = condition, values_from = score) %>%
drop_na()
As this sequence behaves like a list, you can decide which steps to use by selecting the elements:
clean_names = TRUE
convert_tf_to_1_0 = TRUE
convert_to_lower = FALSE
pivot_widr = FALSE
drp_na = TRUE
my_wrangling_wrapper[c(clean_names,
convert_tf_to_1_0,
convert_to_lower,
pivot_widr,
drp_na)]
#Functional sequence with the following components:
#
# 1. janitor::clean_names(.)
# 2. mutate(., across(everything(), convert_true_false_to_1_0))
# 3. drop_na(.)
df %>% my_wrangling_wrapper[c(clean_names,
convert_tf_to_1_0,
convert_to_lower,
pivot_widr,
drp_na)]()
# id is_male weight hash_numb score
#1 1 1 51 Zm1Xx 343
#2 3 1 99 Xc2rm 703
#3 6 0 62 2r2cP 243
#4 12 0 84 llI0f 297
#5 16 0 72 AO76M 475
#6 18 0 63 zGJmW 376
Without %>%
, you could use the equivalent freduce
solution:
clean_names <- function(x) janitor::clean_names(x,dat)
convert_tf_to_1_0 <- function(x) mutate(x,dat, across(everything(),
convert_true_false_to_1_0))
convert_to_lower <- function(x) mutate(x,dat, across(everything(), tolower))
pivot_widr <- function(x) pivot_wider(x,dat, names_from = condition,
values_from = score)
drp_na <- function(x) drop_na(x, dat)
my_wrangling_list <- list(clean_names, convert_tf_to_1_0, drp_na)
magrittr::freduce(df, my_wrangling_list)
Or with %>%
and freduce
:
df %>% freduce(my_wrangling_list)
I wouldn't be too concerned by the piping overhead, see this answer in the link you referenced : when comparing milliseconds, piping has an impact, but when it comes to bigger calculations, piping overhead becomes negligible.
You can use closures to achieve the same as the functional sequence illustrated in the @Waldi answer. Something like:
#we build a wrapper generator providing an arbitrary number of functions to apply
wrapperGenerator<-function(...) {
flist<-list(...)
function(data, conf = rep(TRUE, length(flist))) {
if (!is.logical(conf) || (length(conf)!=length(flist)))
stop("Wrong conf")
for (i in seq_along(flist)) {
if (conf[[i]])
data<-flist[[i]](data)
}
data
}
}
#An example for string manipulation
wg<-wrapperGenerator(tolower, function(x) paste0(x,"_suff"), function(x) substring(x,1,5))
#some usage
require(stringi)
set.seed(1)
data<-stri_rand_strings(10,10)
data
#[1] "GNZuCtwed3" "CAgNlUizNm" "vDe7GN0NrL" "TbUBpfn6iP" "oemYWm1Tjg"
#[6] "TrRF46JWfP" "uISKeFTl5s" "LqLKTtrOmx" "QiOKkCi7F8" "E3dsmnSPob"
#Full pipeline
wg(data)
#[1] "gnzuc" "cagnl" "vde7g" "tbubp" "oemyw" "trrf4" "uiske" "lqlkt" "qiokk"
#[10] "e3dsm"
#Just the first two steps
wg(data,c(TRUE,TRUE,FALSE))
# [1] "gnzuctwed3_suff" "cagnluiznm_suff" "vde7gn0nrl_suff" "tbubpfn6ip_suff"
# [5] "oemywm1tjg_suff" "trrf46jwfp_suff" "uiskeftl5s_suff" "lqlkttromx_suff"
# [9] "qiokkci7f8_suff" "e3dsmnspob_suff"
EDIT
Adding a few comments about how the above works. The wrapperGenerator
is a function that returns a function and it's built just by providing the functions you want to wrap. No data is needed here. The value of wrapperGenerator
is itself a function (wg
in the example) that you can apply to the actual data. By providing the additional conf
argument to this function, you tell which steps are to be performed.
Closures are a very powerful tool in R. Here you find a must read on the subject.
I would (as @Nicola) use closures but with (IMHO) slightly cleaner interface:
function_factory <- function(...) {
all_fns <- list(...)
## ... arguments must be named
stopifnot(!is.null(names(all_fns)))
function(x, ...) {
selected_fns <- rev(as.character(rlang::ensyms(...)))
## if nothing was selected chose everything
if (!length(selected_fns)) {
selected_fns <- rev(names(all_fns))
}
stopifnot(all(selected_fns %in% names(all_fns)))
## function compose operator
`%.%` <- function(f1, f2) function(...) f1(f2(...))
fn_seq <- Reduce(`%.%`, all_fns[selected_fns])
fn_seq(x)
}
}
## define all potential functions via named(!) arguments to funciton_factory
fn_f <- function_factory(multiply_by_1000 = function(x) x * 1000,
make_negative = function(x) -abs(x),
add_100 = function(x) x + 100)
## example with vector as input
x <- 1:10
## to apply a selected subset simply provide the names of the chunks
fn_f(x, add_100)
fn_f(x, multiply_by_1000, make_negative)
## order matters
fn_f(x, add_100, make_negative)
fn_f(x, make_negative, add_100)
## example with data.frame as input
library(dplyr)
m2 <- mtcars
m2 <- m2 %>%
mutate(across(everything(), .fns = function(x) {
x[sample(length(x), 5)] <- NA
x
}))
fn_fd <- function_factory(replace_nas = function(data) mutate(data, across(everything(), .fns = coalesce, -1)),
round = function(data) mutate(data, across(where(is.double), .fns = round, 0)),
append_new = function(data) mutate(data, across(c(vs, am), .fns = paste0, "_new")))
fn_fd(m2, replace_nas, round)
fn_fd(m2, replace_nas, append_new, round)
fn_fd(m2, replace_nas, round, append_new)
## toy example from OP
toy_f <- function_factory(clean_names = clean_names,
convert_0_1 = function(x) mutate(x, across(everything(), convert_true_false_to_1_0)),
to_lower = function(x) mutate(x, across(everything(), tolower)),
pivot = function(x) pivot_wider(x, names_from = condition, values_from = score),
dropna = drop_na)
## do all
toy_f(df)
## everything but conversion and dropping
toy_f(df, clean_names, to_lower, pivot)
function_factory
is a closure
, i.e. it stores all named(!) arguments in all_fns
and returns a function
. The idea is that this function now can access all_fns
and do its magic....
to see which elements we want to use (the rlang::ensyms
part is syntacic sugar, because with this approach we can specify the names of the functions without quotes)Reduce
, which successively combines the elements of a given vector [1] (this could also be done via purrr::compose
)[1] ?Reduce
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With