Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add missing indicator columns using the tidymodels recipes package

Tags:

I'd like to create a recipe using the recipes package that both imputes missing data and adds indicator columns that indicate which values were missing. It would also be nice if there was an option to choose between including an indicator column for every column in the original data frame or only including indicator columns for columns that had missing data in the original data frame. I know I can easily impute missing values with recipes, but is there a built in way to add missing indicator columns?

For example, if I had a data frame like this:

> data.frame(x = c(1, NA, 3), y = 4:6)
   x y
1  1 4
2 NA 5
3  3 6

I would expect that the output after imputation and adding a missing indicator column would look something like this:

   x y x_missing
1  1 4     FALSE
2  2 5      TRUE
3  3 6     FALSE

Of course, for a simple example like that, I could do it by hand. But when working with a large data set in a machine learning pipeline, it would be helpful to have an automated way to do it.

According to the docs for recipes::check_missing, there is a columns argument,

columns A character string of variable names that will be populated (eventually) by the terms argument.

but I'm not sure what that means, since there is no terms argument to check_missing.

For reference, the functionality I'm looking for is implemented in scikit-learn by the MissingIndicator class.

like image 880
Cameron Bieganek Avatar asked Jan 27 '20 21:01

Cameron Bieganek


1 Answers

It's possible to do this by creating a custom step. Following the process as described in one of the vignettes, create functions defining the step, then define prep and bake methods for the custom step.

The following code defines a new step for creating a missing value indicator. A new column is added with the suffix _missing appended to the name.

step_missing_ind <- function(recipe, 
                             ...,
                             role = NA, 
                             trained = FALSE,
                             columns = NULL,
                             skip = FALSE,
                             id = rand_id("missing_ind")) {
  terms <- ellipse_check(...)
  add_step(
    recipe,
    step_missing_ind_new(
      terms = terms, 
      trained = trained,
      role = role, 
      columns = columns,
      skip = skip,
      id = id
    )
  )
}

step_missing_ind_new <- function(terms, 
                                 role, 
                                 trained, 
                                 columns, 
                                 skip, 
                                 id) {
  step(
    subclass = "missing_ind",
    terms = terms,
    role = role,
    trained = trained,
    columns = columns,
    skip = skip,
    id = id
  )
}

print.step_missing_ind <- function(x, width = max(20, options()$width), ...) {
  cat("Missing indicator on ")
  cat(format_selectors(x$terms, width = width))
  if (x$trained) cat(" [trained]\n") else cat("\n")
  invisible(x)
}

prep.step_missing_ind <- function(x, training, info = NULL, ...) {
  col_names <- terms_select(terms = x$terms, info = info)
  step_missing_ind_new(
    terms = x$terms,
    trained = TRUE,
    role = x$role,
    columns = col_names,
    skip = x$skip,
    id = x$id
  )
}

bake.step_missing_ind <- function(object, new_data, ...) {
  for (var in object$columns) {
    new_data[[paste0(var, "_missing")]] <- is.na(new_data[[var]])
  }
  as_tibble(new_data)
}

We can then use this missing indicator step in a recipe pipeline as in the following example, where we add a missing value indicator and perform mean imputation. The ordering of the missing indicator and imputation steps is important: the missing indicator step must be before the imputation step.

library(recipes)

data <- tribble(
  ~x, ~y, ~z,
  1, 4, 7,
  NA, 5, 8,
  3, 6, NA
)

recipe(~ ., data = data) %>%
  step_missing_ind(x, y, z) %>%
  step_meanimpute(x, y, z) %>%
  prep() %>%
  juice()

#> # A tibble: 3 x 6
#>       x     y     z x_missing y_missing z_missing
#>   <dbl> <dbl> <dbl> <lgl>     <lgl>     <lgl>    
#> 1     1     4   7   FALSE     FALSE     FALSE    
#> 2     2     5   8   TRUE      FALSE     FALSE    
#> 3     3     6   7.5 FALSE     FALSE     TRUE
like image 162
mvh Avatar answered Oct 14 '22 09:10

mvh