how to specify train and test indices for xgb.cv in R package XGBoost

Tags:

I recently found out about the folds parameter in xgb.cv, which allows one to specify the indices of the validation set. The helper function xgb.cv.mknfold is then invoked within xgb.cv, which then takes the remaining indices for each fold to be the indices of the training set for the respective fold.

Question: Can I specify both the training and validation indices via any interfaces in the xgboost interface?

My primary motivation is performing time-series cross validation, and I do not want the 'non-validation' indices to be automatically assigned as the training data. An example to illustrate what I want to do:

# assume i have 100 strips of time-series data, where each strip is X_i
# validate only on 10 points after training
fold1:  train on X_1-X_10, validate on X_11-X_20
fold2:  train on X_1-X_20, validate on X_21-X_30
fold3:  train on X_1-X_30, validate on X_31-X_40
...

Currently, using the folds parameter would force me to use the remaining examples as the validation set, which greatly increases the variance of the error estimate since the remaining data greatly outnumber the training data and may have a very different distribution from the training data especially for the earlier folds. Here's what I mean:

fold1:  train on X_1-X_10, validate on X_11-X100 # huge error
...

I'm open to solutions from other packages if they are convenient (i.e. wouldn't require me to pry open source codes) and do not nullify the efficiencies in the original xgboost implementation.

604

asked Sep 07 '15 07:09

JP_smasher

1 Answers

I think the bottom part of the question is the wrong way round, should probably say:

force me to use the remaining examples as the training set

It also seems that the mentioned helper function xgb.cv.mknfold is not around anymore. Note my version of xgboost is 0.71.2.

However, it does seem that this could be achieved fairly straight-forward with a small modification of xgb.cv, e.g. something like:

xgb.cv_new <- function(params = list(), data, nrounds, nfold, label = NULL, 
          missing = NA, prediction = FALSE, showsd = TRUE, metrics = list(), 
          obj = NULL, feval = NULL, stratified = TRUE, folds = NULL, folds_train = NULL, 
          verbose = TRUE, print_every_n = 1L, early_stopping_rounds = NULL, 
          maximize = NULL, callbacks = list(), ...) {
  check.deprecation(...)
  params <- check.booster.params(params, ...)
  for (m in metrics) params <- c(params, list(eval_metric = m))
  check.custom.obj()
  check.custom.eval()
  if ((inherits(data, "xgb.DMatrix") && is.null(getinfo(data, 
                                                        "label"))) || (!inherits(data, "xgb.DMatrix") && is.null(label))) 
    stop("Labels must be provided for CV either through xgb.DMatrix, or through 'label=' when 'data' is matrix")
  if (!is.null(folds)) {
    if (!is.list(folds) || length(folds) < 2) 
      stop("'folds' must be a list with 2 or more elements that are vectors of indices for each CV-fold")
    nfold <- length(folds)
  }
  else {
    if (nfold <= 1) 
      stop("'nfold' must be > 1")
    folds <- generate.cv.folds(nfold, nrow(data), stratified, 
                               label, params)
  }
  params <- c(params, list(silent = 1))
  print_every_n <- max(as.integer(print_every_n), 1L)
  if (!has.callbacks(callbacks, "cb.print.evaluation") && verbose) {
    callbacks <- add.cb(callbacks, cb.print.evaluation(print_every_n, 
                                                       showsd = showsd))
  }
  evaluation_log <- list()
  if (!has.callbacks(callbacks, "cb.evaluation.log")) {
    callbacks <- add.cb(callbacks, cb.evaluation.log())
  }
  stop_condition <- FALSE
  if (!is.null(early_stopping_rounds) && !has.callbacks(callbacks, 
                                                        "cb.early.stop")) {
    callbacks <- add.cb(callbacks, cb.early.stop(early_stopping_rounds, 
                                                 maximize = maximize, verbose = verbose))
  }
  if (prediction && !has.callbacks(callbacks, "cb.cv.predict")) {
    callbacks <- add.cb(callbacks, cb.cv.predict(save_models = FALSE))
  }
  cb <- categorize.callbacks(callbacks)
  dall <- xgb.get.DMatrix(data, label, missing)
  bst_folds <- lapply(seq_along(folds), function(k) {
    dtest <- slice(dall, folds[[k]])
    if (is.null(folds_train))
      dtrain <- slice(dall, unlist(folds[-k]))
    else
      dtrain <- slice(dall, folds_train[[k]])
    handle <- xgb.Booster.handle(params, list(dtrain, dtest))
    list(dtrain = dtrain, bst = handle, watchlist = list(train = dtrain, 
                                                         test = dtest), index = folds[[k]])
  })
  rm(dall)
  basket <- list()
  num_class <- max(as.numeric(NVL(params[["num_class"]], 1)), 
                   1)
  num_parallel_tree <- max(as.numeric(NVL(params[["num_parallel_tree"]], 
                                          1)), 1)
  begin_iteration <- 1
  end_iteration <- nrounds
  for (iteration in begin_iteration:end_iteration) {
    for (f in cb$pre_iter) f()
    msg <- lapply(bst_folds, function(fd) {
      xgb.iter.update(fd$bst, fd$dtrain, iteration - 1, 
                      obj)
      xgb.iter.eval(fd$bst, fd$watchlist, iteration - 1, 
                    feval)
    })
    msg <- simplify2array(msg)
    bst_evaluation <- rowMeans(msg)
    bst_evaluation_err <- sqrt(rowMeans(msg^2) - bst_evaluation^2)
    for (f in cb$post_iter) f()
    if (stop_condition) 
      break
  }
  for (f in cb$finalize) f(finalize = TRUE)
  ret <- list(call = match.call(), params = params, callbacks = callbacks, 
              evaluation_log = evaluation_log, niter = end_iteration, 
              nfeatures = ncol(data), folds = folds)
  ret <- c(ret, basket)
  class(ret) <- "xgb.cv.synchronous"
  invisible(ret)
}

I have just added an optional argument folds_train = NULL and used that later on inside the function in this way (see above):

if (is.null(folds_train))
  dtrain <- slice(dall, unlist(folds[-k]))
else
  dtrain <- slice(dall, folds_train[[k]])

Then you can use the new version of the function, e.g. like below:

# save original version
orig <- xgboost::xgb.cv

# devtools::install_github("miraisolutions/godmode")
godmode:::assignAnywhere("xgb.cv", xgb.cv_new)

# now you can use (call) xgb.cv with the additional argument

# once you are done, or may want to switch back to the original version
# (if you restart R you will also be back to the original version):
godmode:::assignAnywhere("xgb.cv", orig)

So now you should be able to call the function with the extra argument, providing the additional indices for the training data.

Note that I have not had time to test this.

111

answered Sep 28 '22 02:09

RolandASc

Related questions
                            
                                tkplot in latex via knitr and igraph
                            
                                Arules Sequence Mining in R
                            
                                Unit tests for code in the /src folder of an R package?
                            
                                Control ggplot2 facet height independently from number of row facets
                            
                                ggplot/GGally - Parallel Coordinates - y-axis labels
                            
                                Async server or quickly loading state in R
                            
                                Rotation in 'FactoMineR' package
                            
                                Non-monotonic output of Constrained Optimisation in R
                            
                                How to make saving ggplot2 objects more efficient?
                            
                                Unquote string in R's substitute command
                            
                                How to show working directory in R prompt? [duplicate]
                            
                                Add margins with grid R package
                            
                                R Shiny Selectize: How to set the minimum number of options in selectizeInput
                            
                                Change y axis limits for each row of a facet plot in ggplot2
                            
                                Weights with plm package
                            
                                R code in package vignette cannot run on CRAN for security reasons. How to manage such vignette?
                            
                                Scraping data from TripAdvisor using R
                            
                                Package development : location of pdf manual and vignette
                            
                                How to dynamically set histogram binwidth
                            
                                Shiny conditional navlist

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to specify train and test indices for xgb.cv in R package XGBoost

Tags:

r

time-series

xgboost

cross-validation

JP_smasher

People also ask

1 Answers

RolandASc

Recent Activity

Donate For Us