The Python API doesn't give much more information other than that the seed=
parameter is passed to numpy.random.seed
:
seed (int) – Seed used to generate the folds (passed to numpy.random.seed).
But what features of xgboost
use numpy.random.seed
?
xgboost
with all default settings still produces the same performance even when altering the seed.colsample_bytree
does so; different seeds yield different performance.subsample
and the other colsample_*
features, which seems plausible since any form of sampling requires randomness.What other features of xgboost
rely on numpy.random.seed
?
Boosted trees are grown sequentially, with tree growth within one iteration being distributed among threads. To avoid overfitting, randomness is induced through the following params:
colsample_bytree
colsample_bylevel
colsample_bynode
subsample
(note the *sample*
pattern)shuffle
in CV folder creation for cross validationIn addition, you may encounter non-determinism, not controlled by random state, in the following places:
[GPU] histogram building is not deterministic due to the nonassociative aspect of floating point summation.
Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm
when using GPU ranking objective, the result is not deterministic due to the non-associative aspect of floating point summation.
Comment Re: how you know this?
For this to know it's helpful:
To be aware of how trees are grown: Demystify Modern Gradient Boosting Trees (references may be also helpful)
Scanning documentation full text for the terms of interest: random
, sample
, deterministic
, determinism
etc.
Lastly (firstly?), knowing why you need sampling and similar cases from counterparts like bagged trees (RANDOM FORESTS by Leo Breiman) and neural networks (Deep learning with Python by François Chollet, chapter on overfitting) may also be helpful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With