Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to specify train and test indices for xgb.cv in R package XGBoost

I recently found out about the folds parameter in xgb.cv, which allows one to specify the indices of the validation set. The helper function xgb.cv.mknfold is then invoked within xgb.cv, which then takes the remaining indices for each fold to be the indices of the training set for the respective fold.

Question: Can I specify both the training and validation indices via any interfaces in the xgboost interface?

My primary motivation is performing time-series cross validation, and I do not want the 'non-validation' indices to be automatically assigned as the training data. An example to illustrate what I want to do:

# assume i have 100 strips of time-series data, where each strip is X_i
# validate only on 10 points after training
fold1:  train on X_1-X_10, validate on X_11-X_20
fold2:  train on X_1-X_20, validate on X_21-X_30
fold3:  train on X_1-X_30, validate on X_31-X_40
...

Currently, using the folds parameter would force me to use the remaining examples as the validation set, which greatly increases the variance of the error estimate since the remaining data greatly outnumber the training data and may have a very different distribution from the training data especially for the earlier folds. Here's what I mean:

fold1:  train on X_1-X_10, validate on X_11-X100 # huge error
...

I'm open to solutions from other packages if they are convenient (i.e. wouldn't require me to pry open source codes) and do not nullify the efficiencies in the original xgboost implementation.

like image 604
JP_smasher Avatar asked Sep 07 '15 07:09

JP_smasher


People also ask

What is CV in xgboost?

1. xgboost.cv is simply a convenience function for performing Kfold cross-validation. If you perform early stopping it will use the out-of-fold samples for determining the optimal number of boosting iterations.

What is Nrounds in xgboost R?

nrounds : the number of decision trees in the final model. objective : the training objective to use, where “binary:logistic” means a binary classifier.

Does xgboost have cross validation?

1.2 Main features of XGBoost The primary reasons we should use this algorithm are its accuracy, efficiency and feasibility. It is a linear model and a tree learning algorithm that does parallel computations on a single machine. It also has extra features for doing cross validation and computing feature importance.

What is watchlist in xgboost?

# watchlist allows us to monitor the evaluation result on all data in the list. print("Train xgboost using xgb.train with watchlist")


1 Answers

I think the bottom part of the question is the wrong way round, should probably say:

force me to use the remaining examples as the training set

It also seems that the mentioned helper function xgb.cv.mknfold is not around anymore. Note my version of xgboost is 0.71.2.

However, it does seem that this could be achieved fairly straight-forward with a small modification of xgb.cv, e.g. something like:

xgb.cv_new <- function(params = list(), data, nrounds, nfold, label = NULL, 
          missing = NA, prediction = FALSE, showsd = TRUE, metrics = list(), 
          obj = NULL, feval = NULL, stratified = TRUE, folds = NULL, folds_train = NULL, 
          verbose = TRUE, print_every_n = 1L, early_stopping_rounds = NULL, 
          maximize = NULL, callbacks = list(), ...) {
  check.deprecation(...)
  params <- check.booster.params(params, ...)
  for (m in metrics) params <- c(params, list(eval_metric = m))
  check.custom.obj()
  check.custom.eval()
  if ((inherits(data, "xgb.DMatrix") && is.null(getinfo(data, 
                                                        "label"))) || (!inherits(data, "xgb.DMatrix") && is.null(label))) 
    stop("Labels must be provided for CV either through xgb.DMatrix, or through 'label=' when 'data' is matrix")
  if (!is.null(folds)) {
    if (!is.list(folds) || length(folds) < 2) 
      stop("'folds' must be a list with 2 or more elements that are vectors of indices for each CV-fold")
    nfold <- length(folds)
  }
  else {
    if (nfold <= 1) 
      stop("'nfold' must be > 1")
    folds <- generate.cv.folds(nfold, nrow(data), stratified, 
                               label, params)
  }
  params <- c(params, list(silent = 1))
  print_every_n <- max(as.integer(print_every_n), 1L)
  if (!has.callbacks(callbacks, "cb.print.evaluation") && verbose) {
    callbacks <- add.cb(callbacks, cb.print.evaluation(print_every_n, 
                                                       showsd = showsd))
  }
  evaluation_log <- list()
  if (!has.callbacks(callbacks, "cb.evaluation.log")) {
    callbacks <- add.cb(callbacks, cb.evaluation.log())
  }
  stop_condition <- FALSE
  if (!is.null(early_stopping_rounds) && !has.callbacks(callbacks, 
                                                        "cb.early.stop")) {
    callbacks <- add.cb(callbacks, cb.early.stop(early_stopping_rounds, 
                                                 maximize = maximize, verbose = verbose))
  }
  if (prediction && !has.callbacks(callbacks, "cb.cv.predict")) {
    callbacks <- add.cb(callbacks, cb.cv.predict(save_models = FALSE))
  }
  cb <- categorize.callbacks(callbacks)
  dall <- xgb.get.DMatrix(data, label, missing)
  bst_folds <- lapply(seq_along(folds), function(k) {
    dtest <- slice(dall, folds[[k]])
    if (is.null(folds_train))
      dtrain <- slice(dall, unlist(folds[-k]))
    else
      dtrain <- slice(dall, folds_train[[k]])
    handle <- xgb.Booster.handle(params, list(dtrain, dtest))
    list(dtrain = dtrain, bst = handle, watchlist = list(train = dtrain, 
                                                         test = dtest), index = folds[[k]])
  })
  rm(dall)
  basket <- list()
  num_class <- max(as.numeric(NVL(params[["num_class"]], 1)), 
                   1)
  num_parallel_tree <- max(as.numeric(NVL(params[["num_parallel_tree"]], 
                                          1)), 1)
  begin_iteration <- 1
  end_iteration <- nrounds
  for (iteration in begin_iteration:end_iteration) {
    for (f in cb$pre_iter) f()
    msg <- lapply(bst_folds, function(fd) {
      xgb.iter.update(fd$bst, fd$dtrain, iteration - 1, 
                      obj)
      xgb.iter.eval(fd$bst, fd$watchlist, iteration - 1, 
                    feval)
    })
    msg <- simplify2array(msg)
    bst_evaluation <- rowMeans(msg)
    bst_evaluation_err <- sqrt(rowMeans(msg^2) - bst_evaluation^2)
    for (f in cb$post_iter) f()
    if (stop_condition) 
      break
  }
  for (f in cb$finalize) f(finalize = TRUE)
  ret <- list(call = match.call(), params = params, callbacks = callbacks, 
              evaluation_log = evaluation_log, niter = end_iteration, 
              nfeatures = ncol(data), folds = folds)
  ret <- c(ret, basket)
  class(ret) <- "xgb.cv.synchronous"
  invisible(ret)
}

I have just added an optional argument folds_train = NULL and used that later on inside the function in this way (see above):

if (is.null(folds_train))
  dtrain <- slice(dall, unlist(folds[-k]))
else
  dtrain <- slice(dall, folds_train[[k]])

Then you can use the new version of the function, e.g. like below:

# save original version
orig <- xgboost::xgb.cv

# devtools::install_github("miraisolutions/godmode")
godmode:::assignAnywhere("xgb.cv", xgb.cv_new)

# now you can use (call) xgb.cv with the additional argument

# once you are done, or may want to switch back to the original version
# (if you restart R you will also be back to the original version):
godmode:::assignAnywhere("xgb.cv", orig)

So now you should be able to call the function with the extra argument, providing the additional indices for the training data.

Note that I have not had time to test this.

like image 111
RolandASc Avatar answered Sep 28 '22 02:09

RolandASc