Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to reproduce the behaviour of Ridge(normalize=True)?

This block of code:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

X = 'some_data'
y = 'some_target'

penalty = 1.5e-5
A = Ridge(normalize=True, alpha=penalty).fit(X, y)

triggers the following warning:

FutureWarning: 'normalize' was deprecated in version 1.0 and will be removed in 1.2.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

 - model = make_pipeline(StandardScaler(with_mean=False), Ridge())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alpha to: original_alpha * n_samples. 
  warnings.warn(
Ridge(alpha=1.5e-05)

But that codes gives me completely different coefficients, as expected because normalisation and standardisation are different.

B = make_pipeline(StandardScaler(with_mean=False), Ridge(alpha=penalty))
B[1].fit(B[0].fit_transform(X), y)

Output:

A.coefs[0], B[1].coefs[0]
(124.87330648168594, 125511.75051106009)

The result still does not match if I set alpha = penalty * n_features.

Output:

A.coefs[0], B[1].coefs[0]
(124.87330648168594, 114686.09835548172)

even though Ridge() uses a bit different normalization than I expected:

the regressor X will be normalized by subtracting mean and dividing by l2-norm

So what's the proper way to use ridge regression with normalization?
considering that l2-norm seems like being obtained after prediction, data modifying and fitting again
nothing comes to my mind in context of using ridge regression from sklearn, especially after 1.2 version


prepare data for experimenting:

url = 'https://drive.google.com/file/d/1bu64NqQkG0YR8G2CQPkxR1EQUAJ8kCZ6/view?usp=sharing'
url = 'https://drive.google.com/uc?id=' + url.split('/')[-2]
data = pd.read_csv(url, index_col=0)

X = data.iloc[:,:15]
y = data['target']
like image 396
Rossin Avatar asked Dec 20 '25 19:12

Rossin


1 Answers

The difference is that the coefficients reported with normalize=True are to be applied directly to the unscaled inputs, whereas the pipeline approach applies its coefficients to the model's inputs, which are the scaled features.

You can "normalize" (an unfortunate overloading of the word) the coefficients by multiplying/dividing by the features' standard deviation. Together with the change to penalty suggested in the future warning, I get the same outputs:

np.allclose(A.coef_, B[1].coef_ / B[0].scale_)
# True

(I've tested using sklearn.datasets.load_diabetes.)

like image 89
Ben Reiniger Avatar answered Dec 24 '25 05:12

Ben Reiniger



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!