Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spatial resampling in stacking pipelines

Tags:

r

mlr3

I've coded a graph that performs stacking using the mlr3 package. The original code can be found here using a reproducible example. In summary, in a first step, I tuned the parameters of the level 0 learners, and in a final step, I used the predictions from the tuned level 0 learners to obtain the predictions of an ensemble learner (i.e., the averaged predictions of the level 0 learners). For the final step, I used mlr3pipelines::LearnerClassifAvg as follows:

learner_avg <- mlr3pipelines::LearnerClassifAvg$new(id = "classif.avg")
learner_avg$predict_type <- "prob"

My actual data is spatial data, so I used the resampling method below to tune the level 0 learners (thus step 1):

inner_resampling <- mlr3::rsmp("repeated_sptcv_cstf", folds = 10, repeats = 100)

I thought I could use this resampling method for the final step, but it doesn't work. For example, the command line below doesn't work. Only "cv" or "insample" can be used.

po_learner_glmnet <- mlr3pipelines::po("learner_cv", learner = tuned_learner_glmnet, resampling.method = "sptcv_cstf")

I think that the difference between the resampling methods used in the level 0 ("sptcv_cstf") and level 1 ("cv”) could pose an issue. The cross-validated predictions at the level 1 should be obtained using a spatial resampling method to be consistent. Is there a solution to this problem?

If needed, here's an example of how I constructed the stacking pipelines:

po_learner_glmnet <- mlr3pipelines::po("learner_cv", learner = tuned_learner_glmnet)
po_learner_rpart <- mlr3pipelines::po("learner_cv", learner = tuned_learner_rpart)
graph_level_0 <- mlr3pipelines::gunion(list(po_learner_glmnet_cv, po_learner_rpart_cv)) %>>%
  mlr3pipelines::po("featureunion")
graph_levels_0_and_1 <- graph_level_0 %>>% learner_avg
learner_graph_levels_0_and_1 <- mlr3::as_learner(graph_levels_0_and_1)
learner_graph_levels_0_and_1$train(task_sp)

UPDATE:

Following be-marc's response, below are the modifications I made in the PipeOpLearnerCV.R function (see the sections 'my edits'). However, when I execute the modified function, I receive the following error message with the reproducible example below, and after correcting the error with paradox_info:

data <- data.frame(ID = 1:1742, x = runif(1742, -130.88, -61.12), y = runif(1742, 12.12, 61.38), year = runif(1742, 2005, 2020), presence = rep(0:1, each=871), V1 = runif(1742, -3.66247, 2.95120), V2 = runif(1742, -1.6501, 7.5510))
data$presence <- as.factor(data$presence)
## summary(data)
task <- mlr3spatial::as_task_classif_st(x = data, target = "presence", positive = "1", coordinate_names = c("x", "y"), crs = "+proj=longlat +datum=WGS84 +no_defs +type=crs")
task$set_col_roles("ID", roles = "space")
task$set_col_roles("year", roles = "time")

source("E:/R_functions/PipeOpLearnerCV_mod.R")
learner_glmnet = mlr3::lrn("classif.glmnet", predict_type = "prob")
test_po <- PipeOpLearnerCV_mod$new(learner = learner_glmnet, param_vals = list(resampling.method = "sptcv_cstf", resampling.folds = 5))
nop = mlr_pipeops$get("nop")
graph = gunion(list(test_po, nop)) %>>% po("featureunion")
## plot(graph)
graph$train(task)

Error in if (stratify) task$target_names else NULL : 
  argument is of length zero
This happened PipeOp classif.glmnet's $train()

Here is the modified function PipeOpLearnerCV.R:

Note: I don't know what %??% means. Also, I put id = private$.learner$id because I had an error message. But I'm not sure if it's correct.

PipeOpLearnerCV_mod = R6Class("PipeOpLearnerCV_mod",
                          inherit = PipeOpTaskPreproc,
                          public = list(
                            initialize = function(learner, id = NULL, param_vals = list()) {
                              private$.learner = as_learner(learner, clone = TRUE)
                              if (mlr3pipelines:::paradox_info$is_old) {
                                private$.learner$param_set$set_id = ""
                              }
                              ########################################################################
                              ## My edits
                              id = private$.learner$id
                              ## id = id %??% private$.learner$id
                              # FIXME: can be changed when mlr-org/mlr3#470 has an answer
                              ########################################
                              type = private$.learner$task_type
                              task_type = mlr_reflections$task_types[type, mult = "first"]$task
                              
                              ########################################################################
                              ## My edits
                              private$.crossval_param_set = ps(
                                method = p_fct(levels = c("cv", "insample", "sptcv_cstf", "repeated_sptcv_cstf"), tags = c("train", "required")),
                                folds = p_int(lower = 2L, upper = Inf, tags = c("train", "required")), repeats = p_int(lower = 1L, upper = Inf),
                                keep_response = p_lgl(tags = c("train", "required"))
                              )
                              ########################################
                              private$.crossval_param_set$values = list(method = "cv", folds = 3, keep_response = FALSE)
                              if (mlr3pipelines:::paradox_info$is_old) {
                                private$.crossval_param_set$set_id = "resampling"
                              }
                              # Dependencies in paradox have been broken from the start and this is known since at least a year:
                              # https://github.com/mlr-org/paradox/issues/216
                              # The following would make it _impossible_ to set "method" to "insample", because then "folds"
                              # is both _required_ (required tag above) and at the same time must be unset (because of this
                              # dependency). We will opt for the least annoying behaviour here and just not use dependencies
                              # in PipeOp ParamSets.
                              # private$.crossval_param_set$add_dep("folds", "method", CondEqual$new("cv"))  # don't do this.
                              
                              super$initialize(id, alist(resampling = private$.crossval_param_set, private$.learner$param_set), param_vals = param_vals, can_subset_cols = TRUE, task_type = task_type, tags = c("learner", "ensemble"))
                            }
                            
                          ),
                          active = list(
                            learner = function(val) {
                              if (!missing(val)) {
                                if (!identical(val, private$.learner)) {
                                  stop("$learner is read-only.")
                                }
                              }
                              private$.learner
                            },
                            learner_model = function(val) {
                              if (!missing(val)) {
                                if (!identical(val, private$.learner)) {
                                  stop("$learner_model is read-only.")
                                }
                              }
                              if (is.null(self$state) || is_noop(self$state)) {
                                private$.learner
                              } else {
                                multiplicity_recurse(self$state, clone_with_state, learner = private$.learner)
                              }
                            },
                            predict_type = function(val) {
                              if (!missing(val)) {
                                assert_subset(val, names(mlr_reflections$learner_predict_types[[private$.learner$task_type]]))
                                private$.learner$predict_type = val
                              }
                              private$.learner$predict_type
                            }
                          ),
                          private = list(
                            .train_task = function(task) {
                              on.exit({private$.learner$state = NULL})
                              
                              # Train a learner for predicting
                              self$state = private$.learner$train(task)$state
                              pv = private$.crossval_param_set$values
                              
                              # Compute CV Predictions
                              if (pv$method != "insample") {
                                rdesc = mlr_resamplings$get(pv$method)
                                if (pv$method == "cv") rdesc$param_set$values = list(folds = pv$folds)
                                ########################################################################
                                ## My edits
                                if (pv$method == "sptcv_cstf") rdesc$param_set$values = list(folds = pv$folds)
                                if (pv$method == "repeated_sptcv_cstf") rdesc$param_set$values = list(folds = pv$folds, repeats = pv$repeats)
                                ########################################################################
                                rr = resample(task, private$.learner, rdesc)
                                prds = as.data.table(rr$prediction(predict_sets = "test"))
                              } else {
                                prds = as.data.table(private$.learner$predict(task))
                              }
                              
                              private$pred_to_task(prds, task)
                            },
                            
                            .predict_task = function(task) {
                              on.exit({private$.learner$state = NULL})
                              private$.learner$state = self$state
                              prediction = as.data.table(private$.learner$predict(task))
                              private$pred_to_task(prediction, task)
                            },
                            
                            pred_to_task = function(prds, task) {
                              if (!is.null(prds$truth)) prds[, truth := NULL]
                              if (!self$param_set$values$resampling.keep_response && self$learner$predict_type == "prob") {
                                prds[, response := NULL]
                              }
                              renaming = setdiff(colnames(prds), c("row_id", "row_ids"))
                              setnames(prds, renaming, sprintf("%s.%s", self$id, renaming))
                              
                              # This can be simplified for mlr3 >= 0.11.0;
                              # will be always "row_ids"
                              row_id_col = intersect(colnames(prds), c("row_id", "row_ids"))
                              setnames(prds, old = row_id_col, new = task$backend$primary_key)
                              task$select(character(0))$cbind(prds)
                            },
                            .crossval_param_set = NULL,
                            .learner = NULL,
                            .additional_phash_input = function() private$.learner$phash
                          )
)

mlr_pipeops$add("learner_cv", PipeOpLearnerCV_mod, list(R6Class("Learner", public = list(id = "learner_cv", task_type = "classif", param_set = ps()))$new()))
like image 469
Marine Avatar asked Dec 27 '25 14:12

Marine


2 Answers

mlr3 team member here. At the moment it is unfortunately only intended to use cv or insample. You could try changing the PipeOp https://github.com/mlr-org/mlr3pipelines/blob/fa84ba0ff5f38722b58ed62d842b31e7c36b905d/R/PipeOpLearnerCV.R#L130. This should actually be no problem to add spatial_cv as a third option.

like image 166
be-marc Avatar answered Dec 31 '25 17:12

be-marc


Hi (first comment here so apologies if my form is incorrect or if I've commented prematurely),

I used your code for PipeOpLearnerCV_mod with the modification of adding "spcv_coords" as the resampling type, and then wrapped it in the code for pipeline_stacking, using the code below. My stacked learner has seven different base learners, including cv_glmnet.

I tried this using mlr3pipelines 0.5.2 and training completed successfully.

pipeline_stacking_spatial = function(base_learners, super_learner
, method = "spcv_coords", folds = 3, use_features = TRUE) {
  assert_learners(base_learners)
  assert_learner(super_learner)
  checkmate::assert_choice(method, c("spcv_coords", "cv", "insample"))
  checkmate::assert_flag(use_features)

  base_learners_cv = mlr3misc::map(base_learners, po,
    .obj = "learner_cv", resampling.method = method, resampling.folds = folds
  )

  if (use_features) base_learners_cv = c(base_learners_cv, po("nop"))

  gunion(base_learners_cv, in_place = TRUE) %>>!%
     po("featureunion", id = "featureunion_stacking") %>>!%
     super_learner
}

mlr_graphs$add("stacking_spatial", pipeline_stacking_spatial)

Maybe it's the resampling type that's not playing? Good luck!

like image 44
fidlerdb Avatar answered Dec 31 '25 18:12

fidlerdb



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!