Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

subsample, colsample_bytree, colsample_bylevel in XGBClassifier() Python 3.x

I've spent a good deal of time trying to find out what these "subsample", "colsample_by_tree", and "colsample_bylevel" actually did in XGBClassifier() but I can't exactly find out what they do. Can someone please explain briefly what it is they do?

Thanks!

like image 749
Pyrowomat Avatar asked Jun 25 '18 11:06

Pyrowomat


People also ask

What is Colsample_bytree in XGBoost?

colsample_bytree - random subsample of columns when new tree is created. colsample_bylevel - random subsample of columns when every new new level is reached. I.e. you have tree with 3 levels, on 1st level A & B are chosen, on the second B & C etc.

What is Max depth XGBoost?

max_depth [default=6] Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 indicates no limit on depth. Beware that XGBoost aggressively consumes memory when training a deep tree.


1 Answers

The idea of "subsample", "colsample_by_tree", and "colsample_bylevel" comes from Random Forests. In it, you build an ensemble of many trees and then group them together when making a prediction.

The "random" part happens through random sampling of the training samples for each tree (bootstrapping), and building each tree (actually each tree's node) only considering a random subset of the attributes.

In other words, for each tree in a random forest you:

  1. Select a random sample from the dataset to train this tree;
  2. For each node of this tree, use a random subset of the features. This avoids overfitting and decorrelates the trees.

Similarly to random forests, XGB is an ensemble of weak models that when put together give robust and accurate results. The weak models can be decision trees, which can be randomized in the same way as random forests. In this case:

  • "subsample" is the fraction of the training samples (randomly selected) that will be used to train each tree.
  • "colsample_by_tree" is the fraction of features (randomly selected) that will be used to train each tree.
  • "colsample_bylevel" is the fraction of features (randomly selected) that will be used in each node to train each tree.
like image 200
Álvaro Salgado Avatar answered Sep 28 '22 19:09

Álvaro Salgado