Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write an efficient wrapper for data wrangling, allowing to turn off any wrapped part when calling the wrapper

To streamline data wrangling, I write a wrapper function consisted of several "verb functions" that process the data. Each one performs one task on the data. However, not all tasks are applicable to all datasets that pass through this process, and sometimes, for certain data, I might want to switch off some "verb functions", and skip them.

I'm trying to understand whether there's a conventional/canonical way to build such workflow within a wrapper function in R. Importantly, a way that will be efficient, both performance-wise and concise code.

Example

As part of data wrangling, I want to carry out several steps:

  1. Clean up column headers (using janitor::clean_names())
  2. Recode values in the data, such that TRUE and FALSE are replaced with 1 and 0 (using gsub()).
  3. Recode string values to lowercase (using tolower()).
  4. Pivot wider based on specific id column (using tidyr::pivot_wider)
  5. Drop rows with NA values (using dplyr::drop_na())

Toy data

library(stringi)
library(tidyr)

set.seed(2021)

# simulate data
df <-
  data.frame(id = 1:20,
           isMale = rep(c("true", "false"), times = 10),
           WEIGHT = sample(50:100, 20),
           hash_Numb = stri_rand_strings(20, 5)) %>%
  cbind(., score = sample(200:800, size = 20))

# sprinkle NAs randomly
df[c("isMale", "WEIGHT", "hash_Numb", "score")] <-
  lapply(df[c("isMale", "WEIGHT", "hash_Numb", "score")], function(x) {
    x[sample(seq_along(x), 0.25 * length(x))] <- NA
    x
  })


df <- 
  df %>%
  tidyr::expand_grid(., Condition = c("A","B"))

df
#> # A tibble: 40 x 6
#>       id isMale WEIGHT hash_Numb score Condition
#>    <int> <chr>   <int> <chr>     <int> <chr>    
#>  1     1 <NA>       56 EvRAq        NA A        
#>  2     1 <NA>       56 EvRAq        NA B        
#>  3     2 false      87 <NA>        322 A        
#>  4     2 false      87 <NA>        322 B        
#>  5     3 true       95 13pXe       492 A        
#>  6     3 true       95 13pXe       492 B        
#>  7     4 <NA>       88 4WMBS       626 A        
#>  8     4 <NA>       88 4WMBS       626 B        
#>  9     5 true       NA Nrl1W       396 A        
#> 10     5 true       NA Nrl1W       396 B        
#> # ... with 30 more rows

Created on 2021-03-03 by the reprex package (v0.3.0)
The data shows test scores of 20 people who took a test under two conditions. For each person we also know the gender (isMale), the weight in kilograms(WEIGHT), and a unique hash_number.

Data cleanup and wrangling
Before this data is sent to analysis, it needs to be cleaned up, according to a certain chain of steps, which I laid out above.

library(janitor)
library(dplyr)

# helper function
convert_true_false_to_1_0 <- function(x) {
  
  first_pass <- gsub("^(?:TRUE)$", 1, x, ignore.case = TRUE)
  gsub("^(?:FALSE)$", 0, first_pass, ignore.case = TRUE)
}

# chain of steps
df %>%
  janitor::clean_names() %>%
  mutate(across(everything(), convert_true_false_to_1_0)) %>%
  mutate(across(everything(), tolower)) %>%
  pivot_wider(names_from = condition, values_from = score) %>%
  drop_na()

My Question: How to pack this process in a wrapper that allows to flexibly switch some steps off?
One idea I have in my mind is to use a %>% pipe with conditionals such as:

my_wrangling_wrapper <- function(dat,
                                 clean_names       = TRUE, 
                                 convert_tf_to_1_0 = TRUE, 
                                 convert_to_lower  = TRUE, 
                                 pivot_widr        = TRUE,
                                 drp_na            = TRUE){
  dat %>%
    {if (clean_names)       janitor::clean_names(.)                                     else .} %>%
    {if (convert_tf_to_1_0) mutate(., across(everything(), convert_true_false_to_1_0))  else .} %>%
    {if (convert_to_lower)  mutate(., across(everything(), tolower))                    else .} %>%
    {if (pivot_widr)        pivot_wider(., names_from = condition, values_from = score) else .} %>%
    {if (drp_na)            drop_na(.)                                                  else .}
}

This way, all steps are defaulted to happen, unless turned off:

  • Use-case #1 -- Default run:
> my_wrangling_wrapper(dat = df)

## # A tibble: 6 x 6
##   id    is_male weight hash_numb a     b    
##   <chr> <chr>   <chr>  <chr>     <chr> <chr>
## 1 3     1       95     13pxe     492   492  
## 2 9     1       54     hgzxp     519   519  
## 3 12    0       72     vwetc     446   446  
## 4 15    1       52     qadxc     501   501  
## 5 17    1       71     g42vg     756   756  
## 6 18    0       80     qiejd     712   712 
  • Use-case #2 -- Don't convert true/false to 1/0 and don't drop NAs:
> my_wrangling_wrapper(dat = df, convert_tf_to_1_0 = FALSE, drp_na = FALSE)

## # A tibble: 20 x 6
##    id    is_male weight hash_numb a     b    
##    <chr> <chr>   <chr>  <chr>     <chr> <chr>
##  1 1     NA      56     evraq     NA    NA   
##  2 2     false   87     NA        322   322  
##  3 3     true    95     13pxe     492   492  
##  4 4     NA      88     4wmbs     626   626  
##  5 5     true    NA     nrl1w     396   396  
##  6 6     false   NA     4oq74     386   386  
##  7 7     true    NA     gg23f     NA    NA   
##  8 8     false   94     NA        NA    NA   
##  9 9     true    54     hgzxp     519   519  
## 10 10    false   97     NA        371   371  
## 11 11    true    90     NA        768   768  
## 12 12    false   72     vwetc     446   446  
## 13 13    NA      NA     jkhjh     338   338  
## 14 14    false   NA     0swem     778   778  
## 15 15    true    52     qadxc     501   501  
## 16 16    false   75     NA        219   219  
## 17 17    true    71     g42vg     756   756  
## 18 18    false   80     qiejd     712   712  
## 19 19    NA      68     tadad     NA    NA   
## 20 20    NA      53     iyw3o     NA    NA  

My problem

Although the solution I came up with does work, I've learned that relying on the pipe operator is not advised within functions, because it slows down the process (see reference). Also, since %>% is not part of base R, there has to be a way to achieve the same "tweakable wrapping" functionality without the pipe. So I wonder: is there a conventional way to write a wrapper function that could be tweaked to turn off some of its components, and still overall remain performance-efficient?

{It's worth mentioning that I've asked a similar question regarding building a wrapper for ggplot, turning geoms off as desired. The answer was great but not applicable to the current question.}

like image 590
Emman Avatar asked Mar 03 '21 13:03

Emman


3 Answers

Staying with %>%, you could create a functional sequence:

library(magrittr)

my_wrangling_wrapper =
  . %>%
  janitor::clean_names() %>%
  mutate(across(everything(), convert_true_false_to_1_0)) %>%
  mutate(across(everything(), tolower)) %>%
  pivot_wider(names_from = condition, values_from = score) %>%
  drop_na()

As this sequence behaves like a list, you can decide which steps to use by selecting the elements:

clean_names       = TRUE
convert_tf_to_1_0 = TRUE 
convert_to_lower  = FALSE 
pivot_widr        = FALSE
drp_na            = TRUE

my_wrangling_wrapper[c(clean_names,
                       convert_tf_to_1_0,
                       convert_to_lower,
                       pivot_widr,
                       drp_na)]

#Functional sequence with the following components:
#
# 1. janitor::clean_names(.)
# 2. mutate(., across(everything(), convert_true_false_to_1_0))
# 3. drop_na(.)

df %>% my_wrangling_wrapper[c(clean_names,
                               convert_tf_to_1_0,
                               convert_to_lower,
                               pivot_widr,
                               drp_na)]()

#  id is_male weight hash_numb score
#1  1       1     51     Zm1Xx   343
#2  3       1     99     Xc2rm   703
#3  6       0     62     2r2cP   243
#4 12       0     84     llI0f   297
#5 16       0     72     AO76M   475
#6 18       0     63     zGJmW   376

Without %>%, you could use the equivalent freduce solution:

clean_names  <- function(x) janitor::clean_names(x,dat)   

convert_tf_to_1_0 <- function(x) mutate(x,dat, across(everything(),
                                               convert_true_false_to_1_0)) 

convert_to_lower <- function(x) mutate(x,dat, across(everything(), tolower))
         
pivot_widr <- function(x) pivot_wider(x,dat, names_from = condition,
                                             values_from = score) 

drp_na <- function(x) drop_na(x, dat) 

my_wrangling_list <- list(clean_names, convert_tf_to_1_0, drp_na)
magrittr::freduce(df, my_wrangling_list) 

Or with %>% and freduce:

df %>% freduce(my_wrangling_list)

I wouldn't be too concerned by the piping overhead, see this answer in the link you referenced : when comparing milliseconds, piping has an impact, but when it comes to bigger calculations, piping overhead becomes negligible.

like image 196
Waldi Avatar answered Oct 20 '22 16:10

Waldi


You can use closures to achieve the same as the functional sequence illustrated in the @Waldi answer. Something like:

#we build a wrapper generator providing an arbitrary number of functions to apply
wrapperGenerator<-function(...) {
    flist<-list(...)
    function(data, conf = rep(TRUE, length(flist))) {
        if (!is.logical(conf) || (length(conf)!=length(flist)))
            stop("Wrong conf")
        for (i in seq_along(flist)) {
            if (conf[[i]])
                data<-flist[[i]](data)  
        }
        data
    }
}

#An example for string manipulation
wg<-wrapperGenerator(tolower, function(x) paste0(x,"_suff"), function(x) substring(x,1,5))

#some usage
require(stringi)
set.seed(1)
data<-stri_rand_strings(10,10)
data
#[1] "GNZuCtwed3" "CAgNlUizNm" "vDe7GN0NrL" "TbUBpfn6iP" "oemYWm1Tjg"
#[6] "TrRF46JWfP" "uISKeFTl5s" "LqLKTtrOmx" "QiOKkCi7F8" "E3dsmnSPob"

#Full pipeline
wg(data)
#[1] "gnzuc" "cagnl" "vde7g" "tbubp" "oemyw" "trrf4" "uiske" "lqlkt" "qiokk"
#[10] "e3dsm"

#Just the first two steps
wg(data,c(TRUE,TRUE,FALSE))
# [1] "gnzuctwed3_suff" "cagnluiznm_suff" "vde7gn0nrl_suff" "tbubpfn6ip_suff"
# [5] "oemywm1tjg_suff" "trrf46jwfp_suff" "uiskeftl5s_suff" "lqlkttromx_suff"
# [9] "qiokkci7f8_suff" "e3dsmnspob_suff"

EDIT

Adding a few comments about how the above works. The wrapperGenerator is a function that returns a function and it's built just by providing the functions you want to wrap. No data is needed here. The value of wrapperGenerator is itself a function (wg in the example) that you can apply to the actual data. By providing the additional conf argument to this function, you tell which steps are to be performed.

Closures are a very powerful tool in R. Here you find a must read on the subject.

like image 3
nicola Avatar answered Oct 20 '22 17:10

nicola


I would (as @Nicola) use closures but with (IMHO) slightly cleaner interface:

function_factory <- function(...) {
   all_fns <- list(...)
   ## ... arguments must be named
   stopifnot(!is.null(names(all_fns)))
   function(x, ...) {
      selected_fns <- rev(as.character(rlang::ensyms(...)))
      ## if nothing was selected chose everything
      if (!length(selected_fns)) {
         selected_fns <- rev(names(all_fns))
      }
      stopifnot(all(selected_fns %in% names(all_fns)))
      ## function compose operator
      `%.%` <- function(f1, f2) function(...) f1(f2(...))
      fn_seq <- Reduce(`%.%`, all_fns[selected_fns])
      fn_seq(x)
   }
}

## define all potential functions via named(!) arguments to funciton_factory
fn_f <- function_factory(multiply_by_1000 = function(x) x * 1000,
                         make_negative    = function(x) -abs(x),
                         add_100          = function(x) x + 100)

## example with vector as input
x <- 1:10

## to apply a selected subset simply provide the names of the chunks
fn_f(x, add_100)
fn_f(x, multiply_by_1000, make_negative)
## order matters
fn_f(x, add_100, make_negative)
fn_f(x, make_negative, add_100)

## example with data.frame as input
library(dplyr)
m2 <- mtcars
m2 <- m2 %>% 
   mutate(across(everything(), .fns = function(x) {
      x[sample(length(x), 5)] <- NA
      x
   }))
fn_fd <- function_factory(replace_nas = function(data) mutate(data, across(everything(), .fns = coalesce, -1)),
                          round       = function(data) mutate(data, across(where(is.double), .fns = round, 0)),
                          append_new  = function(data) mutate(data, across(c(vs, am), .fns = paste0, "_new")))
fn_fd(m2, replace_nas, round)
fn_fd(m2, replace_nas, append_new, round)
fn_fd(m2, replace_nas, round, append_new)

## toy example from OP
toy_f <- function_factory(clean_names = clean_names,
                          convert_0_1 = function(x) mutate(x, across(everything(), convert_true_false_to_1_0)),
                          to_lower    = function(x) mutate(x, across(everything(), tolower)),
                          pivot       = function(x) pivot_wider(x, names_from = condition, values_from = score),
                          dropna      = drop_na)

## do all
toy_f(df)
## everything but conversion and dropping
toy_f(df, clean_names, to_lower, pivot)

Explanation

  • function_factory is a closure, i.e. it stores all named(!) arguments in all_fns and returns a function. The idea is that this function now can access all_fns and do its magic.
  • The returned function uses also ... to see which elements we want to use (the rlang::ensyms part is syntacic sugar, because with this approach we can specify the names of the functions without quotes)
  • Then, we compose all selected functions into one function using Reduce, which successively combines the elements of a given vector [1] (this could also be done via purrr::compose)
  • Eventually we apply the composed function to our data and voilà

[1] ?Reduce

like image 1
thothal Avatar answered Oct 20 '22 16:10

thothal