Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ridge Regression Grid Search with Pipeline

I am trying to optimize hyperparameters for ridge regression. But also add polynomial features. So, pipeline looks okay but getting error when try to gridsearchcv. Here:

# Importing the Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from collections import Counter
from IPython.core.display import display, HTML
sns.set_style('darkgrid')

# Data Preprocessing 
from sklearn.datasets import load_boston
boston_dataset = load_boston()
dataset = pd.DataFrame(boston_dataset.data, columns = boston_dataset.feature_names)
dataset['MEDV'] = boston_dataset.target

# X and y Variables
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values.reshape(-1,1)

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 25)

# Building the Model ------------------------------------------------------------------------

# Fitting regressior to the Training set
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

steps = [
    ('scalar', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2)),
    ('model', Ridge())
]

ridge_pipe = Pipeline(steps)
ridge_pipe.fit(X_train, y_train)
# Predicting the Test set results
y_pred = ridge_pipe.predict(X_test)

# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = ridge_pipe, X = X_train, y = y_train, cv = 10)
accuracies.mean()
#accuracies.std()

# Applying Grid Search to find the best model and the best parameters
from sklearn.model_selection import GridSearchCV

parameters = [ {'alpha': np.arange(0, 0.2, 0.01) } ]

grid_search = GridSearchCV(estimator = ridge_pipe, 
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)  # <-- GETTING ERROR IN HERE

Error:

ValueError: Invalid parameter ridge for estimator

What to do or, is there a better way to use ridge regression with pipeline? I would be pleased if put some sources about gridsearch because I am a newbie on this. The error:

like image 349
cepel Avatar asked Aug 06 '19 13:08

cepel


People also ask

What is the ridge() method in gridsearchcv?

We can do that with the GridSearchCV method, which I’ll come back to shortly. iii) Ridge () -> This is an estimator that performs the actual regression. The name of the method refers to Tikhonov regularization, more commonly known as ridge regression, that is performed to reduce the effect of multicollinearity.

What are the ridge regression hyper parameters?

Example for Ridge Regression Hyper parameters are: Ridge ( alpha =1.0,*, fit_intercept =True, normalize =False, copy_X =True, max_iter =None, tol =0.001, solver ='auto', random_state =None,) For more information, you can visit this documentation. Now the parameters are set, next step is define the search and execute the search.

How does ridge regression enhance regular linear regression?

It enhances regular linear regression by slightly changing its cost function, which results in less overfit models. In this article, you will learn everything you need to know about Ridge Regression, and how you can start using it in your own machine learning projects.

How effective are pipelines and grid-searches?

But even at this level of simplicity, it’s very evident how effective pipelines and grid-searches can be. Perhaps, what’s not so obvious is the value of critical thinking that goes into building a pipeline and putting it in a grid-search. An example of this is feature selection. Say we want to choose k features in a classification problem.


1 Answers

There are two problems in your code. First since you are using a pipeline, you need to specify in the params list which part of the pipeline does the params belongs to. See the official documentation for more information :

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’, as in the example below

In this case, since alpha is going to be used with ridge-regression and you have used the string model in the Pipeline defintion, you need to rename the key alpha to model_alpha:

steps = [
    ('scalar', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2)),
    ('model', Ridge())  # <------ Whatever string you assign here will be used later
]

# Since you have named it as 'model', you need change it to 'model_alpha'
parameters = [ {'model__alpha': np.arange(0, 0.2, 0.01) } ]

Next, you need to understand this dataset is for Regression. You should not use accuracy here, instead use a regression based scoring function like, mean_squared_error. Here are some other metrics for regression that you can use. Something like this

from sklearn.metrics import mean_squared_error, make_scorer
scoring_func = make_scorer(mean_squared_error)

grid_search = GridSearchCV(estimator = ridge_pipe, 
                           param_grid = parameters,
                           scoring = scoring_func,  #<--- Use the scoring func defined above
                           cv = 10,
                           n_jobs = -1)

Here is a link to a Google colab notebook with working code.

like image 55
Gambit1614 Avatar answered Sep 23 '22 03:09

Gambit1614