I recently found out about the folds
parameter in xgb.cv
, which allows one to specify the indices of the validation set. The helper function xgb.cv.mknfold
is then invoked within xgb.cv
, which then takes the remaining indices for each fold to be the indices of the training set for the respective fold.
Question: Can I specify both the training and validation indices via any interfaces in the xgboost interface?
My primary motivation is performing time-series cross validation, and I do not want the 'non-validation' indices to be automatically assigned as the training data. An example to illustrate what I want to do:
# assume i have 100 strips of time-series data, where each strip is X_i
# validate only on 10 points after training
fold1: train on X_1-X_10, validate on X_11-X_20
fold2: train on X_1-X_20, validate on X_21-X_30
fold3: train on X_1-X_30, validate on X_31-X_40
...
Currently, using the folds
parameter would force me to use the remaining examples as the validation set, which greatly increases the variance of the error estimate since the remaining data greatly outnumber the training data and may have a very different distribution from the training data especially for the earlier folds. Here's what I mean:
fold1: train on X_1-X_10, validate on X_11-X100 # huge error
...
I'm open to solutions from other packages if they are convenient (i.e. wouldn't require me to pry open source codes) and do not nullify the efficiencies in the original xgboost implementation.
1. xgboost.cv is simply a convenience function for performing Kfold cross-validation. If you perform early stopping it will use the out-of-fold samples for determining the optimal number of boosting iterations.
nrounds : the number of decision trees in the final model. objective : the training objective to use, where “binary:logistic” means a binary classifier.
1.2 Main features of XGBoost The primary reasons we should use this algorithm are its accuracy, efficiency and feasibility. It is a linear model and a tree learning algorithm that does parallel computations on a single machine. It also has extra features for doing cross validation and computing feature importance.
# watchlist allows us to monitor the evaluation result on all data in the list. print("Train xgboost using xgb.train with watchlist")
I think the bottom part of the question is the wrong way round, should probably say:
force me to use the remaining examples as the training set
It also seems that the mentioned helper function xgb.cv.mknfold
is not around anymore.
Note my version of xgboost is 0.71.2
.
However, it does seem that this could be achieved fairly straight-forward with a small modification of xgb.cv
, e.g. something like:
xgb.cv_new <- function(params = list(), data, nrounds, nfold, label = NULL,
missing = NA, prediction = FALSE, showsd = TRUE, metrics = list(),
obj = NULL, feval = NULL, stratified = TRUE, folds = NULL, folds_train = NULL,
verbose = TRUE, print_every_n = 1L, early_stopping_rounds = NULL,
maximize = NULL, callbacks = list(), ...) {
check.deprecation(...)
params <- check.booster.params(params, ...)
for (m in metrics) params <- c(params, list(eval_metric = m))
check.custom.obj()
check.custom.eval()
if ((inherits(data, "xgb.DMatrix") && is.null(getinfo(data,
"label"))) || (!inherits(data, "xgb.DMatrix") && is.null(label)))
stop("Labels must be provided for CV either through xgb.DMatrix, or through 'label=' when 'data' is matrix")
if (!is.null(folds)) {
if (!is.list(folds) || length(folds) < 2)
stop("'folds' must be a list with 2 or more elements that are vectors of indices for each CV-fold")
nfold <- length(folds)
}
else {
if (nfold <= 1)
stop("'nfold' must be > 1")
folds <- generate.cv.folds(nfold, nrow(data), stratified,
label, params)
}
params <- c(params, list(silent = 1))
print_every_n <- max(as.integer(print_every_n), 1L)
if (!has.callbacks(callbacks, "cb.print.evaluation") && verbose) {
callbacks <- add.cb(callbacks, cb.print.evaluation(print_every_n,
showsd = showsd))
}
evaluation_log <- list()
if (!has.callbacks(callbacks, "cb.evaluation.log")) {
callbacks <- add.cb(callbacks, cb.evaluation.log())
}
stop_condition <- FALSE
if (!is.null(early_stopping_rounds) && !has.callbacks(callbacks,
"cb.early.stop")) {
callbacks <- add.cb(callbacks, cb.early.stop(early_stopping_rounds,
maximize = maximize, verbose = verbose))
}
if (prediction && !has.callbacks(callbacks, "cb.cv.predict")) {
callbacks <- add.cb(callbacks, cb.cv.predict(save_models = FALSE))
}
cb <- categorize.callbacks(callbacks)
dall <- xgb.get.DMatrix(data, label, missing)
bst_folds <- lapply(seq_along(folds), function(k) {
dtest <- slice(dall, folds[[k]])
if (is.null(folds_train))
dtrain <- slice(dall, unlist(folds[-k]))
else
dtrain <- slice(dall, folds_train[[k]])
handle <- xgb.Booster.handle(params, list(dtrain, dtest))
list(dtrain = dtrain, bst = handle, watchlist = list(train = dtrain,
test = dtest), index = folds[[k]])
})
rm(dall)
basket <- list()
num_class <- max(as.numeric(NVL(params[["num_class"]], 1)),
1)
num_parallel_tree <- max(as.numeric(NVL(params[["num_parallel_tree"]],
1)), 1)
begin_iteration <- 1
end_iteration <- nrounds
for (iteration in begin_iteration:end_iteration) {
for (f in cb$pre_iter) f()
msg <- lapply(bst_folds, function(fd) {
xgb.iter.update(fd$bst, fd$dtrain, iteration - 1,
obj)
xgb.iter.eval(fd$bst, fd$watchlist, iteration - 1,
feval)
})
msg <- simplify2array(msg)
bst_evaluation <- rowMeans(msg)
bst_evaluation_err <- sqrt(rowMeans(msg^2) - bst_evaluation^2)
for (f in cb$post_iter) f()
if (stop_condition)
break
}
for (f in cb$finalize) f(finalize = TRUE)
ret <- list(call = match.call(), params = params, callbacks = callbacks,
evaluation_log = evaluation_log, niter = end_iteration,
nfeatures = ncol(data), folds = folds)
ret <- c(ret, basket)
class(ret) <- "xgb.cv.synchronous"
invisible(ret)
}
I have just added an optional argument folds_train = NULL
and used that later on inside the function in this way (see above):
if (is.null(folds_train))
dtrain <- slice(dall, unlist(folds[-k]))
else
dtrain <- slice(dall, folds_train[[k]])
Then you can use the new version of the function, e.g. like below:
# save original version
orig <- xgboost::xgb.cv
# devtools::install_github("miraisolutions/godmode")
godmode:::assignAnywhere("xgb.cv", xgb.cv_new)
# now you can use (call) xgb.cv with the additional argument
# once you are done, or may want to switch back to the original version
# (if you restart R you will also be back to the original version):
godmode:::assignAnywhere("xgb.cv", orig)
So now you should be able to call the function with the extra argument, providing the additional indices for the training data.
Note that I have not had time to test this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With