I am setting up a predictive analytics pipeline on some data, and I am in the process of model selection. My target variable is skewed, so I would like to log-transform it in order to increase the performance of my linear regression estimators. I came across the relatively new TransformedTargetRegressor
of scikit-learn, and I thought I could use it as part of a pipeline. I am attaching my code
My initial attempt was to transform y_train
before calling gs.fit()
, by casting it to np.log1p(y_train)
. This works, and I can perform the nested cross-validation and return the metrics of interest for all estimators. However, I would like to be able to get R^2 and RMSE for the trained model on previously unseen data (validation set), and I understand that in order to do that, I need to use (for example) r2_score
function on y_val, preds
, where the predictions need to have been transformed back to the real values, i.e., preds = np.expm1(gs.predict(X_val))
### Create a pipeline
pipe = Pipeline([
# the transformer stage is populated by the param_grid
('transformer', TransformedTargetRegressor(func=np.log1p, inverse_func=np.expm1)),
('reg', DummyEstimator()) # Placeholder Estimator
])
### Candidate learning algorithms and their hyperparameters
alphas = [0.001, 0.01, 0.1, 1, 10, 100]
param_grid = [
{'transformer__regressor': Lasso(),
'reg': [Lasso()], # Actual Estimator
'reg__alpha': alphas},
{'transformer__regressor': LassoLars(),
'reg': [LassoLars()], # Actual Estimator
'reg__alpha': alphas},
{'transformer__regressor': Ridge(),
'reg': [Ridge()], # Actual Estimator
'reg__alpha': alphas},
{'transformer__regressor': ElasticNet(),
'reg': [ElasticNet()], # Actual Estimator
'reg__alpha': alphas,
'reg__l1_ratio': [0.25, 0.5, 0.75]}]
### Create grid search (Inner CV)
gs = GridSearchCV(pipe, param_grid=param_grid, cv=5, verbose=2, n_jobs=-1,
scoring=scoring, refit='r2', return_train_score=True)
### Fit
best_model = gs.fit(X_train, y_train)
### scoring metrics for outer CV
scoring = ['neg_mean_absolute_error', 'r2', 'explained_variance', 'neg_mean_squared_error']
### Outer CV
linear_cv_results = cross_validate(gs, X_train, y_train_transformed,
scoring=scoring, cv=5, verbose=3, return_train_score=True)
### Calculate mean metrics
train_r2 = (linear_cv_results['train_r2']).mean()
test_r2 = (linear_cv_results['test_r2']).mean()
train_mae = (-linear_cv_results['train_neg_mean_absolute_error']).mean()
test_mae = (-linear_cv_results['test_neg_mean_absolute_error']).mean()
train_exp_var = (linear_cv_results['train_explained_variance']).mean()
test_exp_var = (linear_cv_results['test_explained_variance']).mean()
train_rmse = (np.sqrt(-linear_cv_results['train_neg_mean_squared_error'])).mean()
test_rmse = (np.sqrt(-linear_cv_results['test_neg_mean_squared_error'])).mean()
Obviously this code snippet does not work, because apparently I can not add TransformedTargetRegressor
into my pipeline, since it does not have a transform
method (I get this TypeError
: TypeError: All intermediate steps should be transformers and implement fit and transform).
Is there a "proper" way of doing this, or do I just have to take the log transformation of y_val
on the fly when I want to call r2_score
function etc?
No, because the scikit-learn original Pipeline
does not change the y
or the number of samples in X
and y
during the steps.
Your use-case is little unclear. What is the need of reg
step if that same reg
is already added to the TransformedTargetRegressor
?
Looking at the documentation of TransformedTargetRegressor
, the parameter regressor
accepts a regressor (which can be also a pipeline which have some feature selection operations on X
and a regressor at final stage). The working of TransformedTargetRegressor
will be:
fit():
regressor.fit(X, func(y))
predict():
inverse_func(regressor.predict(X))
So there is no need to append that same regressor as a new step. Your model selection code now can be:
pipe = TransformedTargetRegressor(regressos = DummyEstimator(),
func=np.log1p,
inverse_func=np.expm1)),
### Candidate learning algorithms and their hyperparameters
alphas = [0.001, 0.01, 0.1, 1, 10, 100]
param_grid = [
{'transformer__regressor': Lasso(),
'transformer__regressor__alpha': alphas},
{'transformer__regressor': LassoLars(),
'transformer__regressor__alpha': alphas},
{'transformer__regressor': Ridge(),
'transformer__regressor__alpha': alphas},
{'transformer__regressor': ElasticNet(),
'transformer__regressor__alpha': alphas,
'transformer__regressor__l1_ratio': [0.25, 0.5, 0.75]}
]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With