Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is xgboost cover calculated?

Tags:

r

xgboost

Could someone explain how the Cover column in the xgboost R package is calculated in the xgb.model.dt.tree function?

In the documentation it says that Cover "is a metric to measure the number of observations affected by the split".

When you run the following code, given in the xgboost documentation for this function, Cover for node 0 of tree 0 is 1628.2500.

data(agaricus.train, package='xgboost')

#Both dataset are list with two items, a sparse matrix and labels
#(labels = outcome column which will be learned).
#Each column of the sparse Matrix is a feature in one hot encoding format.
train <- agaricus.train

bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
               eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")

#agaricus.test$data@Dimnames[[2]] represents the column names of the sparse matrix.
xgb.model.dt.tree(agaricus.train$data@Dimnames[[2]], model = bst)

There are 6513 observations in the train dataset, so can anyone explain why Cover for node 0 of tree 0 is a quarter of this number (1628.25)?

Also, Cover for the node 1 of tree 1 is 788.852 - how is this number calculated?

Any help would be much appreciated. Thanks.

like image 798
dataShrimp Avatar asked Nov 04 '15 11:11

dataShrimp


People also ask

What is cover value in XGBoost?

Cover is defined in xgboost as: the sum of second order gradient of training data classified to the leaf, if it is square loss, this simply corresponds to the number of instances in that branch.

How do I make XGBoost more accurate?

XGBoost can increase the model's accuracy score by using the best parameters during prediction. After initializing XGBoost, we can use it to train our model. Once again, we use the training set. The model learns from this dataset, stores the knowledge gained in memory, and uses this knowledge when making predictions.

How does XGBoost handle Overfitting?

It avoids overfitting by attempting to automatically select the inflection point where performance on the test dataset starts to decrease while performance on the training dataset continues to improve as the model starts to overfit.

What is gain in XGBoost?

Gain for XGBoost is influenced by the count of the number of samples affected by the splits based on a feature (Figure 2A), for LightGBM the total gain of splits which use the feature is summed (Figure 2B), while for CatBoost gain values show for each feature, how much on average the prediction changes if the feature ...


Video Answer


1 Answers

Cover is defined in xgboost as:

the sum of second order gradient of training data classified to the leaf, if it is square loss, this simply corresponds to the number of instances in that branch. Deeper in the tree a node is, lower this metric will be

https://github.com/dmlc/xgboost/blob/f5659e17d5200bd7471a2e735177a81cb8d3012b/R-package/man/xgb.plot.tree.Rd Not particularly well documented....

In order to calculate the cover, we need to know the predictions at that point in the tree, and the 2nd derivative with respect to the loss function.

Lucky for us, the prediction for every data point (6513 of them) in the 0-0 node in your example is .5. This is a global default setting whereby your first prediction at t=0 is .5.

base_score [ default=0.5 ] the initial prediction score of all instances, global bias

http://xgboost.readthedocs.org/en/latest/parameter.html

The gradient of binary logistic (which is your objective function) is p-y, where p = your prediction, and y = true label.

Thus, The hessian (which we need for this) is p*(1-p). Note: the Hessian can be determined without y, the true labels.

So (bringing it home) :

6513 * (.5) * (1 - .5) = 1628.25

In the second tree, the predictions at that point are no longer all .5,sp lets get the predictions after one tree

p = predict(bst,newdata = train$data, ntree=1)

head(p)
[1] 0.8471184 0.1544077 0.1544077 0.8471184 0.1255700 0.1544077

sum(p*(1-p))  # sum of the hessians in that node,(root node has all data)
[1] 788.8521

Note , for linear (squared error) regression the hessian is always one, so the cover indicates how many examples are in that leaf.

The big takeaway is that cover is defined by the hessian of the objective function. Lots of info out there in terms of getting to the gradient, and hessian of the binary logistic function.

These slides are helpful is seeing why he uses hessians as a weighting, and also explain how xgboost splits differently from standard trees. https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf

like image 185
T. Scharf Avatar answered Sep 17 '22 19:09

T. Scharf