I want to perform penalty selection for the LASSO algorithm and predict outcomes using tidymodels
. I will use the Boston housing dataset to illustrate the problem.
library(tidymodels)
library(tidyverse)
library(mlbench)
data("BostonHousing")
dt <- BostonHousing
I first split the dataset into train/test subsets.
dt_split <- initial_split(dt)
dt_train <- training(dt_split)
dt_test <- testing(dt_split)
Defining preprocessing using the recipe
package.
rec <- recipe(medv ~ ., data = dt_train) %>%
step_center(all_predictors(), -all_nominal()) %>%
step_dummy(all_nominal()) %>%
prep()
Initialization of the model and the workflow. I use the glmnet
engine. mixture = 1
means that I choose the LASSO penalty and penalty = tune()
means that I will later use cross-validation to choose the best penalty parameter lambda
.
lasso_mod <- linear_reg(mode = "regression",
penalty = tune(),
mixture = 1) %>%
set_engine("glmnet")
wf <- workflow() %>%
add_model(lasso_mod) %>%
add_recipe(rec)
Preparing stratified 5-fold cross-validation and the penalty grid:
folds <- rsample::vfold_cv(dt_train, v = 5, strata = medv, nbreaks = 5)
my_grid <- tibble(penalty = 10^seq(-2, -1, length.out = 10))
Let us run the cross-validation:
my_res <- wf %>%
tune_grid(resamples = folds,
grid = my_grid,
control = control_grid(verbose = FALSE, save_pred = TRUE),
metrics = metric_set(rmse))
I'm now able to get the best penalty from the grid and update my workflow for this optimal penalty:
best_mod <- my_res %>% select_best("rmse")
print(best_mod)
final_wf <- finalize_workflow(wf, best_mod)
print(final_wf)
== Workflow ===================================================================================================================
Preprocessor: Recipe
Model: linear_reg()
-- Preprocessor ---------------------------------------------------------------------------------------------------------------
2 Recipe Steps
* step_center()
* step_dummy()
-- Model ----------------------------------------------------------------------------------------------------------------------
Linear Regression Model Specification (regression)
Main Arguments:
penalty = 0.0278255940220712
mixture = 1
Computational engine: glmnet
So far so good. Now I want to apply the workflow to the training data to get my final model:
final_mod <- fit(final_wf, data = dt_train) %>%
pull_workflow_fit()
Now here is the issue.
final_mod$fit
is a elnet
and glmnet
object. It contains the full regularization path on a grid of 75 values of the penalty parameter. Hence the previous penalty tuning step is pretty much useless. So the prediction step fails:
predict(final_mod, new_data = dt)
returns an error:
Error in cbind2(1, newx) %*% nbeta :
invalid class 'NA' to dup_mMatrix_as_dgeMatrix
Of course I could use glmnet::cv.glmnet
to get the best penalty and then use the method predict.cv.glmnet
but I need a generic workflow able to work with multiple machine learning models using the same interface. In the documentation of parsnip::linear_reg
there is this note regarding the glmnet engine:
For glmnet models, the full regularization path is always fit regardless of the value given to penalty. Also, there is the option to pass multiple values (or no values) to the penalty argument. When using the predict() method in these cases, the return value depends on the value of penalty. When using predict(), only a single value of the penalty can be used. When predicting on multiple penalties, the multi_predict() function can be used. It returns a tibble with a list column called .pred that contains a tibble with all of the penalty results.
However, I do not understand how I should proceed to get the predictions of the tuned LASSO model using the tidymodels
framework. The multi_predict
function throws the same error as predict
.
Lasso regression is a regularization technique. It is used over regression methods for a more accurate prediction.
You're overlooking that you can use LASSO's L1-norm penalty in logistic regression just as in linear regression.
Unlike in linear regression, scaling of features is essential in LASSO. This is because LASSO's penalty function includes the sum of the absolute value of the feature coefficients.
You are really close here to having everything working right.
Let's read in the data, split it into training/testing and create resampling folds.
library(tidymodels)
library(tidyverse)
library(mlbench)
data("BostonHousing")
dt <- BostonHousing
dt_split <- initial_split(dt)
dt_train <- training(dt_split)
dt_test <- testing(dt_split)
folds <- vfold_cv(dt_train, v = 5, strata = medv, nbreaks = 5)
Now let's create a preprocessing recipe. (Notice that you don't need to prep()
it if you are using a workflow()
; that can get slow if your data is big so might as well not do it until the workflow()
takes care of it for you later.)
rec <- recipe(medv ~ ., data = dt_train) %>%
step_center(all_predictors(), -all_nominal()) %>%
step_dummy(all_nominal())
Now let's make our model, put it together with our recipe in a workflow()
, and tune the workflow with a grid.
lasso_mod <- linear_reg(mode = "regression",
penalty = tune(),
mixture = 1) %>%
set_engine("glmnet")
wf <- workflow() %>%
add_model(lasso_mod) %>%
add_recipe(rec)
my_grid <- tibble(penalty = 10^seq(-2, -1, length.out = 10))
my_res <- wf %>%
tune_grid(resamples = folds,
grid = my_grid,
control = control_grid(verbose = FALSE, save_pred = TRUE),
metrics = metric_set(rmse))
This is the best penalty we got:
best_mod <- my_res %>% select_best("rmse")
best_mod
#> # A tibble: 1 x 2
#> penalty .config
#> <dbl> <chr>
#> 1 0.0215 Preprocessor1_Model04
Here is where we do things a little bit differently than what you did. I am going to finalize my workflow with the best penalty and then fit that finalized workflow to the training data. The output here is a fitted workflow. I do not want to pull the underlying model out of this, because the model needs the preprocessing to work correctly; it was trained expecting the preprocessing to happen.
Instead, I can just predict()
straight on that trained workflow:
final_fitted <- finalize_workflow(wf, best_mod) %>%
fit(data = dt_train)
predict(final_fitted, dt_train)
#> # A tibble: 379 x 1
#> .pred
#> <dbl>
#> 1 18.5
#> 2 24.2
#> 3 23.3
#> 4 21.6
#> 5 37.6
#> 6 21.5
#> 7 16.7
#> 8 15.6
#> 9 21.3
#> 10 21.3
#> # … with 369 more rows
predict(final_fitted, dt_test)
#> # A tibble: 127 x 1
#> .pred
#> <dbl>
#> 1 30.2
#> 2 25.1
#> 3 19.6
#> 4 17.0
#> 5 13.9
#> 6 15.4
#> 7 13.7
#> 8 20.8
#> 9 31.1
#> 10 21.3
#> # … with 117 more rows
Created on 2021-03-16 by the reprex package (v1.0.0)
If you tune a workflow, then you generally want to finalize, fit, and predict a workflow. Exceptions might be if you use a really simple preprocessor in your workflow like a formula that you can pass to fit()
; I show an example that you could do that with here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With