XGBoost with GridSearchCV, Scaling, PCA, and Early-Stopping in sklearn Pipeline

Tags:

I want to combine a XGBoost model with input scaling and feature space reduction by PCA. In addition, the hyperparameters of the model as well as the number of components used in the PCA should be tuned using cross-validation. And to prevent the model from overfitting, early stopping should be added.

For combining the various steps, I decided to use sklearn's Pipeline functionalities.

At the beginning, I had some problems making sure, that the PCA is also applied to the validation set. But I think using XGB__eval_set makes the deal.

The code is actually running without any errors, but seems to run forever (at some point the CPU usage of all cores goes down to zero but the processes continue to run for hours; had to kill the session at some point).

from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBRegressor   

# Train / Test split
X_train, X_test, y_train, y_test = train_test_split(X_with_features, y, test_size=0.2, random_state=123)

# Train / Validation split
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=123)

# Pipeline
pipe = Pipeline(steps=[("Scale", StandardScaler()),
                       ("PCA", PCA()),
                       ("XGB", XGBRegressor())])

# Hyper-parameter grid (Test only)
grid_param_pipe = {'PCA__n_components': [5],
                   'XGB__n_estimators': [1000],
                   'XGB__max_depth': [3],
                   'XGB__reg_alpha': [0.1],
                   'XGB__reg_lambda': [0.1]}

# Grid object
grid_search_pipe = GridSearchCV(estimator=pipe,
                                param_grid=grid_param_pipe,
                                scoring="neg_mean_squared_error",
                                cv=5,
                                n_jobs=5,
                                verbose=3)

# Run CV
grid_search_pipe.fit(X_train, y_train, XGB__early_stopping_rounds=10, XGB__eval_metric="rmse", XGB__eval_set=[[X_val, y_val]])

584

asked Jun 12 '18 19:06

winwin

1 Answers

The problem is that fit method requires an evaluation set created externally, but we cannot create one before the transformation by the pipeline.

This is a bit hacky, but the idea is to create a thin wrapper to the xgboost regressor/classifier that prepare for the evaluation set inside.

from sklearn.base import BaseEstimator
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor, XGBClassifier

class XGBoostWithEarlyStop(BaseEstimator):
    def __init__(self, early_stopping_rounds=5, test_size=0.1, 
                 eval_metric='mae', **estimator_params):
        self.early_stopping_rounds = early_stopping_rounds
        self.test_size = test_size
        self.eval_metric=eval_metric='mae'        
        if self.estimator is not None:
            self.set_params(**estimator_params)

    def set_params(self, **params):
        return self.estimator.set_params(**params)

    def get_params(self, **params):
        return self.estimator.get_params()

    def fit(self, X, y):
        x_train, x_val, y_train, y_val = train_test_split(X, y, test_size=self.test_size)
        self.estimator.fit(x_train, y_train, 
                           early_stopping_rounds=self.early_stopping_rounds, 
                           eval_metric=self.eval_metric, eval_set=[(x_val, y_val)])
        return self

    def predict(self, X):
        return self.estimator.predict(X)

class XGBoostRegressorWithEarlyStop(XGBoostWithEarlyStop):
    def __init__(self, *args, **kwargs):
        self.estimator = XGBRegressor()
        super(XGBoostRegressorWithEarlyStop, self).__init__(*args, **kwargs)

class XGBoostClassifierWithEarlyStop(XGBoostWithEarlyStop):
    def __init__(self, *args, **kwargs):
        self.estimator = XGBClassifier()
        super(XGBoostClassifierWithEarlyStop, self).__init__(*args, **kwargs)

Below is a test.

from sklearn.datasets import load_diabetes
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV

x, y = load_diabetes(return_X_y=True)
print(x.shape, y.shape)
# (442, 10) (442,)

pipe = Pipeline([
    ('pca', PCA(5)),
    ('xgb', XGBoostRegressorWithEarlyStop())
])

param_grid = {
    'pca__n_components': [3, 5, 7],
    'xgb__n_estimators': [10, 20, 30, 50]
}

grid = GridSearchCV(pipe, param_grid, scoring='neg_mean_absolute_error')
grid.fit(x, y)
print(grid.best_params_)

If requesting feature requests to the developers, the easiest extension to make is to allow XGBRegressor to create evaluation set internally when not provided. This way, no extension to the scikit-learn is necessary (I guess).

191

answered Sep 17 '22 19:09

Kota Mori

Related questions
                            
                                chmod 777 to python script
                            
                                User model other than AUTH_USER_MODEL in Django REST Framework
                            
                                Python all points on circle given radius and center
                            
                                Confusing behavior of np.random.multivariate_normal
                            
                                Is there an easy way to have "checkpoints" in an extended python script?
                            
                                how to install pydot & graphviz on google colab?
                            
                                How to generate a Hash or checksum value on Python Dataframe (created from a fixed width file)?
                            
                                Undo "Install Certificates.command"
                            
                                How to efficiently fillna(0) if series is all-nan, or else remaining non-nan entries are zero?
                            
                                Is there a limit to plotting markers with folium?
                            
                                python is operator behaviour with string [duplicate]
                            
                                WebDriverException: Message: 'chromedriver' executable needs to be in PATH while setting UserAgent through Selenium Chromedriver python
                            
                                strptime() argument 1 must be str, not Series time series convert
                            
                                ValueError: Shape must be rank 2 but is rank 1 for 'MatMul' (op: 'MatMul') with input shapes: [2], [2,3]
                            
                                Connecting Keras models / replacing input but keeping layers
                            
                                Sampling with repetition in Python
                            
                                Mapping Python dictionary with multiple keys into dataframe with multiple columns matching keys
                            
                                Python Beautiful Soup 'NavigableString' object has no attribute 'get_text'
                            
                                psycopg2.OperationalError: SSL SYSCALL error: EOF detected
                            
                                Networkx: Finding the shortest path to one of multiple nodes in Graph

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

XGBoost with GridSearchCV, Scaling, PCA, and Early-Stopping in sklearn Pipeline

Tags:

python

scikit-learn

pca

xgboost

grid-search

winwin

People also ask

1 Answers

Kota Mori

Recent Activity

Donate For Us