Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XGboost: cannot pass validation data for eval_set in pipeline

I want to implement GridSearchCV for XGboost model in pipeline. I have preprocessor for data, defined above the code, some grid params

XGBmodel = XGBRegressor(random_state=0)
pipe = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('XGBmodel', XGBmodel)
])

And I want to pass these fit params

fit_params = {"XGBmodel__eval_set": [(X_valid, y_valid)], 
              "XGBmodel__early_stopping_rounds": 10, 
              "XGBmodel__verbose": False}

I am trying to fit model

searchCV = GridSearchCV(pipe, cv=5, param_grid=param_grid, fit_params=fit_params)
searchCV.fit(X_train, y_train)

but I get error on the line with eval_set: DataFrame.dtypes for data must be int, float or bool

I guess it is because validation data aren't going through the preprocessing, but when I google I find that everywhere it is done by this way and seems it should work. Also I tried to find a way to apply preprocessor for validation data separately, but it is not possible to transform validation data without fitting train data before it.

Full code

columns = num_cols + cat_cols
X_train = X_full_train[columns].copy()
X_valid = X_full_valid[columns].copy()

num_preprocessor = SimpleImputer(strategy = 'mean')
cat_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy = 'most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', num_preprocessor, num_cols),
    ('cat', cat_preprocessor, cat_cols)
])

XGBmodel = XGBRegressor(random_state=0)
pipe = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('XGBmodel', XGBmodel)
])

param_grid = {
    "XGBmodel__n_estimators": [10, 50, 100, 500],
    "XGBmodel__learning_rate": [0.1, 0.5, 1],
}

fit_params = {"XGBmodel__eval_set": [(X_valid, y_valid)], 
              "XGBmodel__early_stopping_rounds": 10, 
              "XGBmodel__verbose": False}

searchCV = GridSearchCV(pipe, cv=5, param_grid=param_grid, fit_params=fit_params)
searchCV.fit(X_train, y_train)

Is there any way to preprocess validation data in pipeline? Or maybe completely different way to implement this thing?

like image 818
Надежда Фадеева Avatar asked May 28 '19 16:05

Надежда Фадеева


2 Answers

There is no good way. If you have a long pipeline of transformers before fitting a model, then you can consider to fit those in the pipeline and then apply the model separately.

The underlying issue is that a pipeline has no notion of a validation set used in the model fitting. You can see a discussion on LightGBM github here. Their proposal is to pre-train transformers and apply those to the validation data before you fit the full pipeline. This can be fine, if you use fast transformers, but can double CPU time in an extreme scenario.

like image 152
Mischa Lisovyi Avatar answered Sep 30 '22 13:09

Mischa Lisovyi


One way to train a pipeline that is using EarlyStopping is to train the preprocessing and the regressor separately.

The steps are the following:

  1. fit_transform() the transformers
  2. transform() the validation data.
  3. fit() the model with Xgboost parameters
  4. dump the fitted pipeline

as follows:

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from xgboost import XGBRegressor
import pickle
import numpy as np
import joblib

rng = np.random.RandomState(0)
X_train, X_val = rng.randn(50, 3), rng.randn(20, 3)
y_train, y_val = rng.randn(50, 1), rng.randn(20, 1)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', XGBRegressor(random_state=0)),
])

X_train_transformed = pipeline[:-1].fit_transform(X_train)
x_val_transformed = pipeline[:-1].transform(X_val)

pipeline[-1].fit(
    X=X_train_transformed,
    y=y_train,
    eval_set=[(x_val_transformed, y_val)],
    early_stopping_rounds=10,
)

joblib.dump(pipeline, 'pipeline.pkl')
pipe = joblib.load('pipeline.pkl')
pipe.score(X_val, y_val)

Notes: This will work if you you want to fit the pipeline. However, if you want to perform a GridSearch using earlyStropping, you will have to write your own gridsearch like in this article.

like image 24
Antoine Dubuis Avatar answered Sep 30 '22 12:09

Antoine Dubuis