the xgboost.XGBRegressor
seems to produce the same results despite the fact a new random seed is given.
According to the xgboost
documentation xgboost.XGBRegressor
:
seed : int Random number seed. (Deprecated, please use random_state)
random_state : int Random number seed. (replaces seed)
random_state
is the one to be used, however, no matter what random_state
or seed
I use, the model produce the same results. A Bug?
from xgboost import XGBRegressor
from sklearn.datasets import load_boston
import numpy as np
from itertools import product
def xgb_train_predict(random_state=0, seed=None):
X, y = load_boston(return_X_y=True)
xgb = XGBRegressor(random_state=random_state, seed=seed)
xgb.fit(X, y)
y_ = xgb.predict(X)
return y_
check = xgb_train_predict()
random_state = [1, 42, 58, 69, 72]
seed = [None, 2, 24, 85, 96]
for r, s in product(random_state, seed):
y_ = xgb_train_predict(r, s)
assert np.equal(y_, check).all()
print('CHECK! \t random_state: {} \t seed: {}'.format(r, s))
[Out]:
CHECK! random_state: 1 seed: None
CHECK! random_state: 1 seed: 2
CHECK! random_state: 1 seed: 24
CHECK! random_state: 1 seed: 85
CHECK! random_state: 1 seed: 96
CHECK! random_state: 42 seed: None
CHECK! random_state: 42 seed: 2
CHECK! random_state: 42 seed: 24
CHECK! random_state: 42 seed: 85
CHECK! random_state: 42 seed: 96
CHECK! random_state: 58 seed: None
CHECK! random_state: 58 seed: 2
CHECK! random_state: 58 seed: 24
CHECK! random_state: 58 seed: 85
CHECK! random_state: 58 seed: 96
CHECK! random_state: 69 seed: None
CHECK! random_state: 69 seed: 2
CHECK! random_state: 69 seed: 24
CHECK! random_state: 69 seed: 85
CHECK! random_state: 69 seed: 96
CHECK! random_state: 72 seed: None
CHECK! random_state: 72 seed: 2
CHECK! random_state: 72 seed: 24
CHECK! random_state: 72 seed: 85
CHECK! random_state: 72 seed: 96
The random state hyperparameter in the train_test_split() function controls the shuffling process. With random_state=None , we get different train and test sets across different executions and the shuffling process is out of control. With random_state=0 , we get the same train and test sets across different executions.
XGBoost can increase the model's accuracy score by using the best parameters during prediction. After initializing XGBoost, we can use it to train our model. Once again, we use the training set. The model learns from this dataset, stores the knowledge gained in memory, and uses this knowledge when making predictions.
DMatrix is an internal data structure that is used by XGBoost, which is optimized for both memory efficiency and training speed. You can construct DMatrix from multiple different sources of data. Parameters. data (os. PathLike/string/numpy.
By Harish AmatyaPosted in Questions & Answers 2 years ago. 0. "On larger datasets where runtime is a consideration, you can use parallelism to build your models faster. It's common to set the parameter n_jobs equal to the number of cores on your machine."
It seems (I didn't know it myself before starting to dig for an answer :) ), that xgboost uses random generator only for sub-sampling, see this Laurae's comment on a similar github issue. And otherwise behavior is deterministic.
If you would have used sampling, there is an issue in the seed
/random_state
handling by the current sklearn API in xgboost. seed
is indeed claimed to be deprecated, but it seems that if one provides it, it will still be used over random_state
, as can be seen here in the code. This comment is relevant only when you have seed not None
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With