When I use xgboost to train my data for a 2-cates classification problem
,I'd like to use the early stopping to get the best model, but I'm confused about which one to use in my predict as the early stop will return 3 different choices.
For example, should I use
preds = model.predict(xgtest, ntree_limit=bst.best_iteration)
or should I use
preds = model.predict(xgtest, ntree_limit=bst.best_ntree_limit)
or both right, and they should be applied to different circumstances? If so, how can I judge which one to use?
Here is the original quotation of the xgboost document, but it didn't give the reason why and I also didn't find the comparison between those params:
Early Stopping
If you have a validation set, you can use early stopping to find the optimal number of boosting rounds. Early stopping requires at least one set in evals. If there's more than one, it will use the last.
train(..., evals=evals, early_stopping_rounds=10)
The model will train until the validation score stops improving. Validation error needs to decrease at least every early_stopping_rounds to continue training.
If early stopping occurs, the model will have three additional fields: bst.best_score, bst.best_iteration and bst.best_ntree_limit. Note that train() will return a model from the last iteration, not the best one. Pr ediction
A model that has been trained or loaded can perform predictions on data sets.
# 7 entities, each contains 10 features data = np.random.rand(7, 10) dtest = xgb.DMatrix(data) ypred = bst.predict(dtest)
If early stopping is enabled during training, you can get predictions from the best iteration with bst.best_ntree_limit:
ypred = bst.predict(dtest,ntree_limit=bst.best_ntree_limit)
Thanks in advance.
DMatrix is an internal data structure that is used by XGBoost, which is optimized for both memory efficiency and training speed. You can construct DMatrix from multiple different sources of data. Parameters. data (os. PathLike/string/numpy.
Early stopping is a technique used to stop training when the loss on validation dataset starts increase (in the case of minimizing the loss). That's why to train a model (any model, not only Xgboost) you need two separate datasets: training data for model fitting, validation data for loss monitoring and early stopping.
In my point of view, both parameters refer to the same think, or at least have the same goal. But I would rather use:
preds = model.predict(xgtest, ntree_limit=bst.best_iteration)
From the source code, we can see here that best_ntree_limit
is going to be dropped in favor of best_iteration
.
def _get_booster_layer_trees(model: "Booster") -> Tuple[int, int]:
"""Get number of trees added to booster per-iteration. This function will be removed
once `best_ntree_limit` is dropped in favor of `best_iteration`. Returns
`num_parallel_tree` and `num_groups`.
"""
Additionally, best_ntree_limit
has been removed from EarlyStopping documentation page.
So I think this attribute exist only for backwards compatibility reasons. From this code snippet and the documentation, we can therefore assume that best_ntree_limit
is or will be deprecated.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With