Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is recipes 20x slower than handmade pretreatment while training a caret model?

In order to build a stacking model, I trained many base models using different pretreatments on the same dataset. In order to keep track of the way to build the design matrices I used the recipes package and defined my own steps. But using a recipe with a custom step into a caret training model revealed to be 20x slower than applying the same pretreatment and training the caret model with the handmade design matrix. Any idea why and how to improve this?

I provide a reproducible example below:

# Loading libraries
packs <- c("tidyverse", "caret", "e1071", "wavelets", "recipes")
InstIfNec<-function (pack) {
    if (!do.call(require,as.list(pack))) {
        do.call(install.packages,as.list(pack)) }
    do.call(require,as.list(pack)) }
lapply(packs, InstIfNec)

# Getting data
data(biomass)
biomass <- select(biomass,-dataset,-sample)

# Defining custom pretreatment algorithm
HaarTransform <- function(DF1) {
    w <- function(k) {
        s1 = dwt(k, filter = "haar")
        return (s1@V[[1]])
    }
    Smt = as.matrix(DF1)
    Smt = t(base::apply(Smt, 1, w))
    return (data.frame(Smt))
}

# Creating the custom step function
step_Haar_new <- function(terms, role, trained, skip, columns, id) {
    step(subclass = "Haar",  terms = terms, role = role, 
         trained = trained, skip = skip, columns = columns, id = id)
}

step_Haar<-function(recipe, ..., role="predictor", trained=FALSE, skip=FALSE,  
                    columns=NULL, id=rand_id("Harr")) {
    terms=ellipse_check(...)
    add_step(recipe, step_Haar_new(terms=terms, role=role, trained=trained, 
                               skip=skip, columns=columns, id=id))
}

prep.step_Haar <- function(x, training, info = NULL, ...) {
    col_names <- terms_select(terms = x$terms, info = info)
    step_Haar_new(terms = x$terms, role = x$role, trained = TRUE,
        skip = x$skip, columns = col_names, id = x$id)
}

bake.step_Haar <- function(object, new_data, ...) {
    predictors <- HaarTransform(dplyr::select(new_data, object$columns))
    new_data[, object$columns] <- NULL
    bind_cols(new_data, predictors)
}

# Fiting the caret model using recipe
system.time({
    Haar_recipe<-recipe(carbon ~ ., biomass) %>% 
        step_Haar(all_predictors()) 
    set.seed(1)
    fit <- caret::train(Haar_recipe, data = biomass, method = "svmLinear")  
})


# Fiting the caret model with hand made pretreatment
system.time({
    df<-HaarTransform(biomass[,-1])
    set.seed(1)
    fit2<-caret::train(x=df, y=biomass[, 1], method="svmLinear")
})

# Comparing results
fit; fit2

# Both way provide the same result but the recipes way take ~20 seconds while hand made pretreatment take ~1.5 seconds

Using profvis, it looks like the recipe way made many attempts (i.e. 27 times) to do the same job using different runs of try() and eval() functions.

like image 681
denisC Avatar asked Nov 06 '22 18:11

denisC


1 Answers

train does preprocessing the right way by re-executing the the recipe within each resample. This is needed when your preprocessing method produces some estimate or statistic from the data to apply the preprocessing. PCA, imputation, and other methods should be applied in this way or else you get a very optimistic view of performance.

For some techniques, such as the spatial sign, there is nothing to estimate and that could be done prior to resampling. Otherwise, it should go inside (which is why we do that).

like image 147
topepo Avatar answered Nov 15 '22 07:11

topepo