Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grid Search and Early Stopping Using Cross Validation with XGBoost in SciKit-Learn

I am fairly new to sci-kit learn and have been trying to hyper-paramater tune XGBoost. My aim is to use early stopping and grid search to tune the model parameters and use early stopping to control the number of trees and avoid overfitting.

As I am using cross validation for the grid search, I was hoping to also use cross-validation in the early stopping criteria. The code I have so far looks like this:

import numpy as np
import pandas as pd
from sklearn import model_selection
import xgboost as xgb

#Import training and test data
train = pd.read_csv("train.csv").fillna(value=-999.0)
test = pd.read_csv("test.csv").fillna(value=-999.0)

# Encode variables
y_train = train.price_doc
x_train = train.drop(["id", "timestamp", "price_doc"], axis=1)

# XGBoost - sklearn method
gbm = xgb.XGBRegressor()

xgb_params = {
'learning_rate': [0.01, 0.1],
'n_estimators': [2000],
'max_depth': [3, 5, 7, 9],
'gamma': [0, 1],
'subsample': [0.7, 1],
'colsample_bytree': [0.7, 1]
}

fit_params = {
'early_stopping_rounds': 30,
'eval_metric': 'mae',
'eval_set': [[x_train,y_train]]
}

grid = model_selection.GridSearchCV(gbm, xgb_params, cv=5, 
fit_params=fit_params)
grid.fit(x_train,y_train)

The problem I am having is the 'eval_set' parameter. I understand that this wants the predictor and response variables but I am not sure how I can use the cross-validation data as the early stopping criteria.

Does anyone know how to overcome this problem? Thanks.

like image 548
George Avatar asked May 09 '17 09:05

George


2 Answers

It does not make much sense to include early stopping in GridSearchCV. The early stopping is used to quickly find the best n_rounds in train/valid situation. If we do not care about 'quickly', we can just tune the n_rounds. Assuming GridSearchCV has the functionality to do the early stopping n_rounds for each fold, then we will have N(number of fold) n_rounds for each set of hyperparameter. Maybe average of n_rounds can be used for final best hyperparameter set, but it might not be a good choice when the n_rounds different too much from each other. So including early stopping in GridSearchCV might increase the speed of trianing, but the result might not be the best.

The suggested method in the accepted answer is more like tuning the n_rounds parameter than early stopping as the author acknowledges that "it wont save the computational time needed to evaluate all the possible n_rounds though".

like image 132
Ben2018 Avatar answered Oct 22 '22 17:10

Ben2018


You could pass you early_stopping_rounds, and eval_set as an extra fit_params to GridSearchCV, and that would actually work. However, GridSearchCV will not change the fit_params between the different folds, so you would end up using the same eval_set in all the folds, which might not be what you mean by CV.

model=xgb.XGBClassifier()
clf = GridSearchCV(model, parameters,
                         fit_params={'early_stopping_rounds':20,\
                         'eval_set':[(X,y)]},cv=kfold)  

After some tweaking, I found the safest way to integrate early_stopping_rounds and the sklearn API is to implement an early_stopping mechanism your self. You can do it if you do a GridSearchCV with n_rounds as paramter to be tuned. You can then watch the mean_validation_score for the different models with increasing n_rounds. Then you can define a custom heuristic for early stop. it wont save the computational time needed to evaluate all the possible n_rounds though

I think it is also a better approach then using a single split hold-out for this purpose.

like image 31
00__00__00 Avatar answered Oct 22 '22 16:10

00__00__00