I am trying to optimize hyperparameters for ridge regression. But also add polynomial features. So, pipeline looks okay but getting error when try to gridsearchcv. Here:
# Importing the Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from collections import Counter
from IPython.core.display import display, HTML
sns.set_style('darkgrid')
# Data Preprocessing
from sklearn.datasets import load_boston
boston_dataset = load_boston()
dataset = pd.DataFrame(boston_dataset.data, columns = boston_dataset.feature_names)
dataset['MEDV'] = boston_dataset.target
# X and y Variables
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values.reshape(-1,1)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 25)
# Building the Model ------------------------------------------------------------------------
# Fitting regressior to the Training set
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
steps = [
('scalar', StandardScaler()),
('poly', PolynomialFeatures(degree=2)),
('model', Ridge())
]
ridge_pipe = Pipeline(steps)
ridge_pipe.fit(X_train, y_train)
# Predicting the Test set results
y_pred = ridge_pipe.predict(X_test)
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = ridge_pipe, X = X_train, y = y_train, cv = 10)
accuracies.mean()
#accuracies.std()
# Applying Grid Search to find the best model and the best parameters
from sklearn.model_selection import GridSearchCV
parameters = [ {'alpha': np.arange(0, 0.2, 0.01) } ]
grid_search = GridSearchCV(estimator = ridge_pipe,
param_grid = parameters,
scoring = 'accuracy',
cv = 10,
n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train) # <-- GETTING ERROR IN HERE
Error:
ValueError: Invalid parameter ridge for estimator
What to do or, is there a better way to use ridge regression with pipeline? I would be pleased if put some sources about gridsearch because I am a newbie on this. The error:
We can do that with the GridSearchCV method, which I’ll come back to shortly. iii) Ridge () -> This is an estimator that performs the actual regression. The name of the method refers to Tikhonov regularization, more commonly known as ridge regression, that is performed to reduce the effect of multicollinearity.
Example for Ridge Regression Hyper parameters are: Ridge ( alpha =1.0,*, fit_intercept =True, normalize =False, copy_X =True, max_iter =None, tol =0.001, solver ='auto', random_state =None,) For more information, you can visit this documentation. Now the parameters are set, next step is define the search and execute the search.
It enhances regular linear regression by slightly changing its cost function, which results in less overfit models. In this article, you will learn everything you need to know about Ridge Regression, and how you can start using it in your own machine learning projects.
But even at this level of simplicity, it’s very evident how effective pipelines and grid-searches can be. Perhaps, what’s not so obvious is the value of critical thinking that goes into building a pipeline and putting it in a grid-search. An example of this is feature selection. Say we want to choose k features in a classification problem.
There are two problems in your code. First since you are using a pipeline
, you need to specify in the params
list which part of the pipeline does the params belongs to. See the official documentation for more information :
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’, as in the example below
In this case, since alpha
is going to be used with ridge-regression
and you have used the string model
in the Pipeline defintion, you need to rename the key alpha
to model_alpha
:
steps = [
('scalar', StandardScaler()),
('poly', PolynomialFeatures(degree=2)),
('model', Ridge()) # <------ Whatever string you assign here will be used later
]
# Since you have named it as 'model', you need change it to 'model_alpha'
parameters = [ {'model__alpha': np.arange(0, 0.2, 0.01) } ]
Next, you need to understand this dataset is for Regression. You should not use accuracy
here, instead use a regression based scoring function like, mean_squared_error
. Here are some other metrics for regression that you can use. Something like this
from sklearn.metrics import mean_squared_error, make_scorer
scoring_func = make_scorer(mean_squared_error)
grid_search = GridSearchCV(estimator = ridge_pipe,
param_grid = parameters,
scoring = scoring_func, #<--- Use the scoring func defined above
cv = 10,
n_jobs = -1)
Here is a link to a Google colab notebook with working code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With