Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What features of xgboost are affected by seed (random_state)?

The Python API doesn't give much more information other than that the seed= parameter is passed to numpy.random.seed:

seed (int) – Seed used to generate the folds (passed to numpy.random.seed).

But what features of xgboost use numpy.random.seed?

  • Running xgboost with all default settings still produces the same performance even when altering the seed.
  • I have already been able to verify colsample_bytree does so; different seeds yield different performance.
  • I have been told it is also used by subsample and the other colsample_* features, which seems plausible since any form of sampling requires randomness.

What other features of xgboost rely on numpy.random.seed?

like image 650
jorijnsmit Avatar asked Oct 15 '22 22:10

jorijnsmit


1 Answers

Boosted trees are grown sequentially, with tree growth within one iteration being distributed among threads. To avoid overfitting, randomness is induced through the following params:

  • colsample_bytree
  • colsample_bylevel
  • colsample_bynode
  • subsample (note the *sample* pattern)
  • shuffle in CV folder creation for cross validation

In addition, you may encounter non-determinism, not controlled by random state, in the following places:

[GPU] histogram building is not deterministic due to the nonassociative aspect of floating point summation.

Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm

when using GPU ranking objective, the result is not deterministic due to the non-associative aspect of floating point summation.

Comment Re: how you know this?

For this to know it's helpful:

  1. To be aware of how trees are grown: Demystify Modern Gradient Boosting Trees (references may be also helpful)

  2. Scanning documentation full text for the terms of interest: random, sample, deterministic, determinism etc.

  3. Lastly (firstly?), knowing why you need sampling and similar cases from counterparts like bagged trees (RANDOM FORESTS by Leo Breiman) and neural networks (Deep learning with Python by François Chollet, chapter on overfitting) may also be helpful.

like image 77
Sergey Bushmanov Avatar answered Oct 21 '22 02:10

Sergey Bushmanov