I've been exploring the xgboost
package in R and went through several demos as well as tutorials but this still confuses me: after using xgb.cv
to do cross validation, how does the optimal parameters get passed to xgb.train
? Or should I calculate the ideal parameters (such as nround
, max.depth
) based on the output of xgb.cv
?
param <- list("objective" = "multi:softprob", "eval_metric" = "mlogloss", "num_class" = 12) cv.nround <- 11 cv.nfold <- 5 mdcv <-xgb.cv(data=dtrain,params = param,nthread=6,nfold = cv.nfold,nrounds = cv.nround,verbose = T) md <-xgb.train(data=dtrain,params = param,nround = 80,watchlist = list(train=dtrain,test=dtest),nthread=6)
XGBoost has a very useful function called as “cv” which performs cross-validation at each boosting iteration and thus returns the optimum number of trees required. Tune tree-specific parameters ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees.
Wide variety of tuning parameters : XGBoost internally has parameters for cross-validation, regularization, user-defined objective functions, missing values, tree parameters, scikit-learn compatible API etc.
DMatrix is an internal data structure that is used by XGBoost, which is optimized for both memory efficiency and training speed. You can construct DMatrix from multiple different sources of data.
xgb. train is an advanced interface for training an xgboost model. The xgboost function is a simpler wrapper for xgb.
Looks like you misunderstood xgb.cv
, it is not a parameter searching function. It does k-folds cross validation, nothing more.
In your code, it does not change the value of param
.
To find best parameters in R's XGBoost, there are some methods. These are 2 methods,
(1) Use mlr
package, http://mlr-org.github.io/mlr-tutorial/release/html/
There is a XGBoost + mlr example code in the Kaggle's Prudential challenge,
But that code is for regression, not classification. As far as I know, there is no mlogloss
metric yet in mlr
package, so you must code the mlogloss measurement from scratch by yourself. CMIIW.
(2) Second method, by manually setting the parameters then repeat, example,
param <- list(objective = "multi:softprob", eval_metric = "mlogloss", num_class = 12, max_depth = 8, eta = 0.05, gamma = 0.01, subsample = 0.9, colsample_bytree = 0.8, min_child_weight = 4, max_delta_step = 1 ) cv.nround = 1000 cv.nfold = 5 mdcv <- xgb.cv(data=dtrain, params = param, nthread=6, nfold=cv.nfold, nrounds=cv.nround, verbose = T)
Then, you find the best (minimum) mlogloss,
min_logloss = min(mdcv[, test.mlogloss.mean]) min_logloss_index = which.min(mdcv[, test.mlogloss.mean])
min_logloss
is the minimum value of mlogloss, while min_logloss_index
is the index (round).
You must repeat the process above several times, each time change the parameters manually (mlr
does the repeat for you). Until finally you get best global minimum min_logloss
.
Note: You can do it in a loop of 100 or 200 iterations, in which for each iteration you set the parameters value randomly. This way, you must save the best [parameters_list, min_logloss, min_logloss_index]
in variables or in a file.
Note: better to set random seed by set.seed()
for reproducible result. Different random seed yields different result. So, you must save [parameters_list, min_logloss, min_logloss_index, seednumber]
in the variables or file.
Say that finally you get 3 results in 3 iterations/repeats:
min_logloss = 2.1457, min_logloss_index = 840 min_logloss = 2.2293, min_logloss_index = 920 min_logloss = 1.9745, min_logloss_index = 780
Then you must use the third parameters (it has global minimum min_logloss
of 1.9745
). Your best index (nrounds) is 780
.
Once you get best parameters, use it in the training,
# best_param is global best param with minimum min_logloss # best_min_logloss_index is the global minimum logloss index nround = 780 md <- xgb.train(data=dtrain, params=best_param, nrounds=nround, nthread=6)
I don't think you need watchlist
in the training, because you have done the cross validation. But if you still want to use watchlist
, it is just okay.
Even better you can use early stopping in xgb.cv
.
mdcv <- xgb.cv(data=dtrain, params=param, nthread=6, nfold=cv.nfold, nrounds=cv.nround, verbose = T, early.stop.round=8, maximize=FALSE)
With this code, when mlogloss
value is not decreasing in 8 steps, the xgb.cv
will stop. You can save time. You must set maximize
to FALSE
, because you expect minimum mlogloss.
Here is an example code, with 100 iterations loop, and random chosen parameters.
best_param = list() best_seednumber = 1234 best_logloss = Inf best_logloss_index = 0 for (iter in 1:100) { param <- list(objective = "multi:softprob", eval_metric = "mlogloss", num_class = 12, max_depth = sample(6:10, 1), eta = runif(1, .01, .3), gamma = runif(1, 0.0, 0.2), subsample = runif(1, .6, .9), colsample_bytree = runif(1, .5, .8), min_child_weight = sample(1:40, 1), max_delta_step = sample(1:10, 1) ) cv.nround = 1000 cv.nfold = 5 seed.number = sample.int(10000, 1)[[1]] set.seed(seed.number) mdcv <- xgb.cv(data=dtrain, params = param, nthread=6, nfold=cv.nfold, nrounds=cv.nround, verbose = T, early.stop.round=8, maximize=FALSE) min_logloss = min(mdcv[, test.mlogloss.mean]) min_logloss_index = which.min(mdcv[, test.mlogloss.mean]) if (min_logloss < best_logloss) { best_logloss = min_logloss best_logloss_index = min_logloss_index best_seednumber = seed.number best_param = param } } nround = best_logloss_index set.seed(best_seednumber) md <- xgb.train(data=dtrain, params=best_param, nrounds=nround, nthread=6)
With this code, you run cross validation 100 times, each time with random parameters. Then you get best parameter set, that is in the iteration with minimum min_logloss
.
Increase the value of early.stop.round
in case you find out that it's too small (too early stopping). You need also to change the random parameter values' limit based on your data characteristics.
And, for 100 or 200 iterations, I think you want to change verbose
to FALSE.
Side note: That is example of random method, you can adjust it e.g. by Bayesian optimization for better method. If you have Python version of XGBoost, there is a good hyperparameter script for XGBoost, https://github.com/mpearmain/BayesBoost to search for best parameters set using Bayesian optimization.
Edit: I want to add 3rd manual method, posted by "Davut Polat" a Kaggle master, in the Kaggle forum.
Edit: If you know Python and sklearn, you can also use GridSearchCV along with xgboost.XGBClassifier or xgboost.XGBRegressor
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With