Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Evaluate transformations with the same model in scikit-learn

I would like to perform a regression analysis and test different transformations of the input variables for the same model. To accomplish this, I created a dictionary with the different pipelines, which I loop through:

import numpy as np
import pandas as pd
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PowerTransformer
from sklearn.compose import TransformedTargetRegressor

# Define transformations and models
models = {
    'linear': LinearRegression(),
    'power': make_pipeline(PowerTransformer(), LinearRegression()),
    'log': make_pipeline(FunctionTransformer(np.log, np.exp),
                         LinearRegression()),
    'log-sqrt': TransformedTargetRegressor(
        regressor=make_pipeline(
            FunctionTransformer(np.log, np.exp),
            LinearRegression()),
        func=np.sqrt,
        inverse_func=np.square
        )
    }

parameters = pd.DataFrame()
for name, model in models.items():
    model.fit(x_train, y_train)
    y_hat = model.predict(x_hat)
    y_hat_train = model.predict(x_train)
    r2 = model.score(x_train, y_train)
    parameters.at[name, 'MSE'] = mean_squared_error(y_train, y_hat_train)
    parameters.at[name, 'R2'] = r2
best_model = parameters['R2'].idxmax()

This works. However, there is probably a more elegant solution similar to GridSearchCV for evaluating models. Can anyone give me some advice on what I should be looking for?

like image 943
p1xel Avatar asked Oct 23 '25 15:10

p1xel


2 Answers

The first thing that comes to my mind is this:

pipeline = Pipeline([
    ("preproc", None),
    ("model", LinearRegression()),
])

params = {
    "preproc": [
        None,
        PowerTransformer(),
        FunctionTransformer(np.log, np.exp),
    ],
    "model": [
        LinearRegression(),
        TransformedTargetRegressor(
            LinearRegression(),
            func=np.sqrt,
            inverse_func=np.square,
        ),
     ],
}


search = GridSearchCV(
    pipe,
    params,
    ...
)

This doesn't net you cool names for the resultant models, and it also produces 6 models instead of your 4 (having additionally power-sqrt and plain sqrt). You also won't retain the actual trained models (aside from an optional-but-on-by-default final "best" estimator), if that's something you need. You'll get some automatic cross-validation, which is probably good. And you'll get parallelization for free.

This is heavily using the fact that entire pipeline steps can be set as a "hyperparameter" in the sklearn search API, and its recognition of None as a no-op step.

I believe Pipeline(preproc, TransformedTarget(model)) [that my approach produces] is operationally the same as TransformedTarget(Pipeline(preproc, model)) [that you've coded].

like image 141
Ben Reiniger Avatar answered Oct 26 '25 03:10

Ben Reiniger


If the aim is to obtain the best performing pipeline in a hyperparameter search framework, then you could use Optuna as follows:

import optuna

def objective(trial):
    
    # the preprocessing method as hyperparameter to optimize:
    preproc = trial.suggest_categorical(
        "preproc", [None, "power", "log", "log-sqrt"]
    )
    
    if preproc is None :
        model = LinearRegression()
        
    if preproc == "power":
        make_pipeline(PowerTransformer(), LinearRegression())
        
    if method == "log":
        make_pipeline(
            FunctionTransformer(np.log, np.exp), LinearRegression())
        
    if method == "log-sqrt":
        model = TransformedTargetRegressor(
            regressor=make_pipeline(
                FunctionTransformer(np.log, np.exp),
                LinearRegression()),
            func=np.sqrt,
            inverse_func=np.square
        )

    
    score = cross_val_score(model, X_train, y_train, scoring='roc_auc', cv=3)
    
    roc = score.mean()
    
    return roc


# set up the study
study = optuna.create_study(
    direction="maximize",
    sampler=optuna.samplers.GridSampler()
)

study.optimize(objective, n_trials=15)

Haven't tested the code, so take it as guideline. Adapted from https://www.kaggle.com/code/solegalli/nested-hyperparameter-spaces-with-optuna

When you create a study in Optuna, with the param sampler, you can select if you want to use GridSearch, RandomSearch or other. For this one, as it is just 4 alternatives, gridsearch should be enough.

You can obtain the best pipeline like this:

study.best_params

And the results for each trial like this:

study.trials_dataframe()
like image 45
Sole Galli Avatar answered Oct 26 '25 03:10

Sole Galli



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!