I've spent a good deal of time trying to find out what these "subsample", "colsample_by_tree", and "colsample_bylevel" actually did in XGBClassifier() but I can't exactly find out what they do. Can someone please explain briefly what it is they do?
Thanks!
colsample_bytree - random subsample of columns when new tree is created. colsample_bylevel - random subsample of columns when every new new level is reached. I.e. you have tree with 3 levels, on 1st level A & B are chosen, on the second B & C etc.
max_depth [default=6] Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 indicates no limit on depth. Beware that XGBoost aggressively consumes memory when training a deep tree.
The idea of "subsample", "colsample_by_tree", and "colsample_bylevel" comes from Random Forests. In it, you build an ensemble of many trees and then group them together when making a prediction.
The "random" part happens through random sampling of the training samples for each tree (bootstrapping), and building each tree (actually each tree's node) only considering a random subset of the attributes.
In other words, for each tree in a random forest you:
Similarly to random forests, XGB is an ensemble of weak models that when put together give robust and accurate results. The weak models can be decision trees, which can be randomized in the same way as random forests. In this case:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With