Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Xgboost: what is the difference among bst.best_score, bst.best_iteration and bst.best_ntree_limit?

When I use xgboost to train my data for a 2-cates classification problem,I'd like to use the early stopping to get the best model, but I'm confused about which one to use in my predict as the early stop will return 3 different choices. For example, should I use

preds = model.predict(xgtest, ntree_limit=bst.best_iteration)

or should I use

preds = model.predict(xgtest, ntree_limit=bst.best_ntree_limit)

or both right, and they should be applied to different circumstances? If so, how can I judge which one to use?

Here is the original quotation of the xgboost document, but it didn't give the reason why and I also didn't find the comparison between those params:

Early Stopping

If you have a validation set, you can use early stopping to find the optimal number of boosting rounds. Early stopping requires at least one set in evals. If there's more than one, it will use the last.

train(..., evals=evals, early_stopping_rounds=10)

The model will train until the validation score stops improving. Validation error needs to decrease at least every early_stopping_rounds to continue training.

If early stopping occurs, the model will have three additional fields: bst.best_score, bst.best_iteration and bst.best_ntree_limit. Note that train() will return a model from the last iteration, not the best one. Pr ediction

A model that has been trained or loaded can perform predictions on data sets.

# 7 entities, each contains 10 features 
data = np.random.rand(7, 10) 
dtest = xgb.DMatrix(data) 
ypred = bst.predict(dtest)

If early stopping is enabled during training, you can get predictions from the best iteration with bst.best_ntree_limit:

ypred = bst.predict(dtest,ntree_limit=bst.best_ntree_limit)

Thanks in advance.

like image 258
LancelotHolmes Avatar asked Apr 21 '17 04:04

LancelotHolmes


People also ask

What is DMatrix in XGBoost?

DMatrix is an internal data structure that is used by XGBoost, which is optimized for both memory efficiency and training speed. You can construct DMatrix from multiple different sources of data. Parameters. data (os. PathLike/string/numpy.

What is early stopping rounds XGBoost?

Early stopping is a technique used to stop training when the loss on validation dataset starts increase (in the case of minimizing the loss). That's why to train a model (any model, not only Xgboost) you need two separate datasets: training data for model fitting, validation data for loss monitoring and early stopping.


1 Answers

In my point of view, both parameters refer to the same think, or at least have the same goal. But I would rather use:

preds = model.predict(xgtest, ntree_limit=bst.best_iteration)

From the source code, we can see here that best_ntree_limit is going to be dropped in favor of best_iteration.

def _get_booster_layer_trees(model: "Booster") -> Tuple[int, int]:
    """Get number of trees added to booster per-iteration.  This function will be removed
    once `best_ntree_limit` is dropped in favor of `best_iteration`.  Returns
    `num_parallel_tree` and `num_groups`.
    """

Additionally, best_ntree_limit has been removed from EarlyStopping documentation page.

So I think this attribute exist only for backwards compatibility reasons. From this code snippet and the documentation, we can therefore assume that best_ntree_limit is or will be deprecated.

like image 200
Antoine Dubuis Avatar answered Nov 15 '22 23:11

Antoine Dubuis