Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GridSearchCV - XGBoost - Early Stopping

i am trying to do hyperparemeter search with using scikit-learn's GridSearchCV on XGBoost. During gridsearch i'd like it to early stop, since it reduce search time drastically and (expecting to) have better results on my prediction/regression task. I am using XGBoost via its Scikit-Learn API.

    model = xgb.XGBRegressor()     GridSearchCV(model, paramGrid, verbose=verbose ,fit_params={'early_stopping_rounds':42}, cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]), n_jobs=n_jobs, iid=iid).fit(trainX,trainY) 

I tried to give early stopping parameters with using fit_params, but then it throws this error which is basically because of lack of validation set which is required for early stopping:

/opt/anaconda/anaconda3/lib/python3.5/site-packages/xgboost/callback.py in callback(env=XGBoostCallbackEnv(model=<xgboost.core.Booster o...teration=4000, rank=0, evaluation_result_list=[]))     187         else:     188             assert env.cvfolds is not None     189      190     def callback(env):     191         """internal function""" --> 192         score = env.evaluation_result_list[-1][1]         score = undefined         env.evaluation_result_list = []     193         if len(state) == 0:     194             init(env)     195         best_score = state['best_score']     196         best_iteration = state['best_iteration'] 

How can i apply GridSearch on XGBoost with using early_stopping_rounds?

note: model is working without gridsearch, also GridSearch works without 'fit_params={'early_stopping_rounds':42}

like image 673
ayyayyekokojambo Avatar asked Mar 24 '17 07:03

ayyayyekokojambo


People also ask

What is early stopping in Xgboost?

Early stopping is a technique used to stop training when the loss on validation dataset starts increase (in the case of minimizing the loss). That's why to train a model (any model, not only Xgboost) you need two separate datasets: training data for model fitting, validation data for loss monitoring and early stopping.

How much time does GridSearchCV take?

Observing the above time numbers, for parameter grid having 3125 combinations, the Grid Search CV took 10856 seconds (~3 hrs) whereas Halving Grid Search CV took 465 seconds (~8 mins), which is approximate 23x times faster.

What is Param_grid in GridSearchCV?

param_grid – A dictionary with parameter names as keys and lists of parameter values. 3. scoring – The performance measure. For example, 'r2' for regression models, 'precision' for classification models.


2 Answers

When using early_stopping_rounds you also have to give eval_metric and eval_set as input parameter for the fit method. Early stopping is done via calculating the error on an evaluation set. The error has to decrease every early_stopping_rounds otherwise the generation of additional trees is stopped early.

See the documentation of xgboosts fit method for details.

Here you see a minimal fully working example:

import xgboost as xgb from sklearn.model_selection import GridSearchCV from sklearn.model_selection import TimeSeriesSplit  cv = 2  trainX= [[1], [2], [3], [4], [5]] trainY = [1, 2, 3, 4, 5]  # these are the evaluation sets testX = trainX  testY = trainY  paramGrid = {"subsample" : [0.5, 0.8]}  fit_params={"early_stopping_rounds":42,              "eval_metric" : "mae",              "eval_set" : [[testX, testY]]}  model = xgb.XGBRegressor() gridsearch = GridSearchCV(model, paramGrid, verbose=1 ,          fit_params=fit_params,          cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX,trainY])) gridsearch.fit(trainX,trainY) 
like image 51
glao Avatar answered Oct 07 '22 23:10

glao


An update to @glao's answer and a response to @Vasim's comment/question, as of sklearn 0.21.3 (note that fit_params has been moved out of the instantiation of GridSearchCV and been moved into the fit() method; also, the import specifically pulls in the sklearn wrapper module from xgboost):

import xgboost.sklearn as xgb from sklearn.model_selection import GridSearchCV from sklearn.model_selection import TimeSeriesSplit  cv = 2  trainX= [[1], [2], [3], [4], [5]] trainY = [1, 2, 3, 4, 5]  # these are the evaluation sets testX = trainX  testY = trainY  paramGrid = {"subsample" : [0.5, 0.8]}  fit_params={"early_stopping_rounds":42,              "eval_metric" : "mae",              "eval_set" : [[testX, testY]]}  model = xgb.XGBRegressor()  gridsearch = GridSearchCV(model, paramGrid, verbose=1,                       cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]))  gridsearch.fit(trainX, trainY, **fit_params) 
like image 26
emigre459 Avatar answered Oct 08 '22 00:10

emigre459