Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalization in sci-kit learn linear_models

If the normalization parameter is set to True in any of the linear models in sklearn.linear_model, is normalization applied during the score step?

For example:

from sklearn import linear_model
from sklearn.datasets import load_boston

a = load_boston()

l = linear_model.ElasticNet(normalize=False)
l.fit(a["data"][:400], a["target"][:400])
print l.score(a["data"][400:], a["target"][400:])
# 0.24192774524694727

l = linear_model.ElasticNet(normalize=True)
l.fit(a["data"][:400], a["target"][:400])
print l.score(a["data"][400:], a["target"][400:])
# -2.6177006348389167

In this case we see a degradation in the prediction power when we set normalize=True, and I can't tell if this is simply an artifact of the score function not applying the normalization, or if the normalized values caused the model performance to drop.

like image 914
mgoldwasser Avatar asked Oct 20 '15 20:10

mgoldwasser


People also ask

What does sklearn linear_model do?

linear_model is a class of the sklearn module if contain different functions for performing machine learning with linear models. The term linear model implies that the model is specified as a linear combination of features.

What does linear_model LinearRegression () do?

LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. Whether to calculate the intercept for this model.

What is the l1 ratio?

The parameter l1_ratio corresponds to alpha in the glmnet R package while alpha corresponds to the lambda parameter in glmnet. Specifically, l1_ratio = 1 is the lasso penalty. Currently, l1_ratio <= 0.01 is not reliable, unless you supply your own sequence of alpha. Read more in the User Guide.

What is Reg Coef_?

The coef_ gives the coefficient of the features of your dataset.


1 Answers

The normalization is indeed applied to both fit data and predict data. The reason you see such different results is that the range of the columns in the Boston House Price dataset varies widely:

>>> from sklearn.datasets import load_boston
>>> boston = load_boston()
>>> boston.data.std(0)
array([  8.58828355e+00,   2.32993957e+01,   6.85357058e+00,
         2.53742935e-01,   1.15763115e-01,   7.01922514e-01,
         2.81210326e+01,   2.10362836e+00,   8.69865112e+00,
         1.68370495e+02,   2.16280519e+00,   9.12046075e+01,
         7.13400164e+00])

This means that the regularization terms in the ElasticNet have a very different effect on normalized vs unnormalized data, and this is why the results differ. You can confirm this by setting the regularization strength (alpha) to a very small number, e.g. 1E-8. In this case, regularization has very little effect and the normalization no longer affects prediction results.

like image 185
jakevdp Avatar answered Oct 04 '22 21:10

jakevdp