I want to implement GridSearchCV for XGboost model in pipeline. I have preprocessor for data, defined above the code, some grid params
XGBmodel = XGBRegressor(random_state=0)
pipe = Pipeline(steps=[
('preprocess', preprocessor),
('XGBmodel', XGBmodel)
])
And I want to pass these fit params
fit_params = {"XGBmodel__eval_set": [(X_valid, y_valid)],
"XGBmodel__early_stopping_rounds": 10,
"XGBmodel__verbose": False}
I am trying to fit model
searchCV = GridSearchCV(pipe, cv=5, param_grid=param_grid, fit_params=fit_params)
searchCV.fit(X_train, y_train)
but I get error on the line with eval_set
: DataFrame.dtypes for data must be int, float or bool
I guess it is because validation data aren't going through the preprocessing, but when I google I find that everywhere it is done by this way and seems it should work. Also I tried to find a way to apply preprocessor for validation data separately, but it is not possible to transform validation data without fitting train data before it.
Full code
columns = num_cols + cat_cols
X_train = X_full_train[columns].copy()
X_valid = X_full_valid[columns].copy()
num_preprocessor = SimpleImputer(strategy = 'mean')
cat_preprocessor = Pipeline(steps=[
('imputer', SimpleImputer(strategy = 'most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(transformers=[
('num', num_preprocessor, num_cols),
('cat', cat_preprocessor, cat_cols)
])
XGBmodel = XGBRegressor(random_state=0)
pipe = Pipeline(steps=[
('preprocess', preprocessor),
('XGBmodel', XGBmodel)
])
param_grid = {
"XGBmodel__n_estimators": [10, 50, 100, 500],
"XGBmodel__learning_rate": [0.1, 0.5, 1],
}
fit_params = {"XGBmodel__eval_set": [(X_valid, y_valid)],
"XGBmodel__early_stopping_rounds": 10,
"XGBmodel__verbose": False}
searchCV = GridSearchCV(pipe, cv=5, param_grid=param_grid, fit_params=fit_params)
searchCV.fit(X_train, y_train)
Is there any way to preprocess validation data in pipeline? Or maybe completely different way to implement this thing?
There is no good way. If you have a long pipeline of transformers before fitting a model, then you can consider to fit those in the pipeline and then apply the model separately.
The underlying issue is that a pipeline has no notion of a validation set used in the model fitting. You can see a discussion on LightGBM
github here. Their proposal is to pre-train transformers and apply those to the validation data before you fit the full pipeline. This can be fine, if you use fast transformers, but can double CPU time in an extreme scenario.
One way to train a pipeline that is using EarlyStopping is to train the preprocessing and the regressor separately.
The steps are the following:
fit_transform()
the transformerstransform()
the validation data.fit()
the model with Xgboost parametersas follows:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from xgboost import XGBRegressor
import pickle
import numpy as np
import joblib
rng = np.random.RandomState(0)
X_train, X_val = rng.randn(50, 3), rng.randn(20, 3)
y_train, y_val = rng.randn(50, 1), rng.randn(20, 1)
pipeline = Pipeline([
('scaler', StandardScaler()),
('regressor', XGBRegressor(random_state=0)),
])
X_train_transformed = pipeline[:-1].fit_transform(X_train)
x_val_transformed = pipeline[:-1].transform(X_val)
pipeline[-1].fit(
X=X_train_transformed,
y=y_train,
eval_set=[(x_val_transformed, y_val)],
early_stopping_rounds=10,
)
joblib.dump(pipeline, 'pipeline.pkl')
pipe = joblib.load('pipeline.pkl')
pipe.score(X_val, y_val)
Notes: This will work if you you want to fit the pipeline. However, if you want to perform a GridSearch using earlyStropping, you will have to write your own gridsearch like in this article.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With