I'd like to create a recipe using the recipes package that both imputes missing data and adds indicator columns that indicate which values were missing. It would also be nice if there was an option to choose between including an indicator column for every column in the original data frame or only including indicator columns for columns that had missing data in the original data frame. I know I can easily impute missing values with recipes, but is there a built in way to add missing indicator columns?
For example, if I had a data frame like this:
> data.frame(x = c(1, NA, 3), y = 4:6)
x y
1 1 4
2 NA 5
3 3 6
I would expect that the output after imputation and adding a missing indicator column would look something like this:
x y x_missing
1 1 4 FALSE
2 2 5 TRUE
3 3 6 FALSE
Of course, for a simple example like that, I could do it by hand. But when working with a large data set in a machine learning pipeline, it would be helpful to have an automated way to do it.
According to the docs for recipes::check_missing
, there is a columns
argument,
columns A character string of variable names that will be populated (eventually) by the terms argument.
but I'm not sure what that means, since there is no terms
argument to check_missing
.
For reference, the functionality I'm looking for is implemented in scikit-learn by the MissingIndicator class.
It's possible to do this by creating a custom step. Following the process as described in one of the vignettes, create functions defining the step, then define prep
and bake
methods for the custom step.
The following code defines a new step for creating a missing value indicator. A new column is added with the suffix _missing
appended to the name.
step_missing_ind <- function(recipe,
...,
role = NA,
trained = FALSE,
columns = NULL,
skip = FALSE,
id = rand_id("missing_ind")) {
terms <- ellipse_check(...)
add_step(
recipe,
step_missing_ind_new(
terms = terms,
trained = trained,
role = role,
columns = columns,
skip = skip,
id = id
)
)
}
step_missing_ind_new <- function(terms,
role,
trained,
columns,
skip,
id) {
step(
subclass = "missing_ind",
terms = terms,
role = role,
trained = trained,
columns = columns,
skip = skip,
id = id
)
}
print.step_missing_ind <- function(x, width = max(20, options()$width), ...) {
cat("Missing indicator on ")
cat(format_selectors(x$terms, width = width))
if (x$trained) cat(" [trained]\n") else cat("\n")
invisible(x)
}
prep.step_missing_ind <- function(x, training, info = NULL, ...) {
col_names <- terms_select(terms = x$terms, info = info)
step_missing_ind_new(
terms = x$terms,
trained = TRUE,
role = x$role,
columns = col_names,
skip = x$skip,
id = x$id
)
}
bake.step_missing_ind <- function(object, new_data, ...) {
for (var in object$columns) {
new_data[[paste0(var, "_missing")]] <- is.na(new_data[[var]])
}
as_tibble(new_data)
}
We can then use this missing indicator step in a recipe pipeline as in the following example, where we add a missing value indicator and perform mean imputation. The ordering of the missing indicator and imputation steps is important: the missing indicator step must be before the imputation step.
library(recipes)
data <- tribble(
~x, ~y, ~z,
1, 4, 7,
NA, 5, 8,
3, 6, NA
)
recipe(~ ., data = data) %>%
step_missing_ind(x, y, z) %>%
step_meanimpute(x, y, z) %>%
prep() %>%
juice()
#> # A tibble: 3 x 6
#> x y z x_missing y_missing z_missing
#> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl>
#> 1 1 4 7 FALSE FALSE FALSE
#> 2 2 5 8 TRUE FALSE FALSE
#> 3 3 6 7.5 FALSE FALSE TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With