Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ridge Regression: Scikit-learn vs. direct calculation does not match for alpha > 0

In Ridge Regression, we are solving Ax=b with L2 Regularization. The direct calculation is given by:

x = (ATA + alpha * I)-1ATb

I have looked at the scikit-learn code and they do implement the same calculation. But, I can't seem to get the same results for alpha > 0

The minimal code to reproduce this.

import numpy as np
A = np.asmatrix(np.c_[np.ones((10,1)),np.random.rand(10,3)])
b = np.asmatrix(np.random.rand(10,1))
I = np.identity(A.shape[1])
alpha = 1
x = np.linalg.inv(A.T*A + alpha * I)*A.T*b
print(x.T)
>>> [[ 0.37371021  0.19558433  0.06065241  0.17030177]]

from sklearn.linear_model import Ridge
model = Ridge(alpha = alpha).fit(A[:,1:],b)
print(np.c_[model.intercept_, model.coef_])
>>> [[ 0.61241566  0.02727579 -0.06363385  0.05303027]]

Any suggestions on what I can do to resolve this discrepancy?

like image 408
amitkaps Avatar asked Jul 25 '16 08:07

amitkaps


People also ask

How does Alpha affect ridge regression?

If Alpha is close to zero, the Ridge term itself is very small and thus the final error is based on RSS alone. If Alpha is too large, the impact of shrinkage grows and the coefficients B1, B2 ... Bn tends to zero. Choosing the right value helps the model learn the right features and better generalize the coefficients.

What is the optimal value of alpha for ridge regression?

The last line of code prints the optimum values, which come out to be 0.4322 for alpha and 0.0065 for lambda. Once we have trained the model, we use it to generate the predictions and print the evaluation results for both the training and test datasets, using the lines of code below.

What does a high alpha mean in ridge regression?

The L2 norm term in ridge regression is weighted by the regularization parameter alpha. So, if the alpha value is 0, it means that it is just an Ordinary Least Squares Regression model. So, the larger is the alpha, the higher is the smoothness constraint.

What is Alpha in Ridge and Lasso?

Here, α (alpha) is the parameter which balances the amount of emphasis given to minimizing RSS vs minimizing sum of square of coefficients. α can take various values: α = 0: The objective becomes same as simple linear regression. We'll get the same coefficients as simple linear regression.


1 Answers

This modification seems to yield the same result for the direct version and the numpy version:

import numpy as np
A = np.asmatrix(np.random.rand(10,3))
b = np.asmatrix(np.random.rand(10,1))
I = np.identity(A.shape[1])
alpha = 1
x = np.linalg.inv(A.T*A + alpha * I)*A.T*b
print (x.T)


from sklearn.linear_model import Ridge
model = Ridge(alpha = alpha, tol=0.1, fit_intercept=False).fit(A ,b)

print model.coef_
print model.intercept_

It seems the main reason for the difference is the class Ridge has the parameter fit_intercept=True (by inheritance from class _BaseRidge) (source)

This is applying a data centering procedure before passing the matrices to the _solve_cholesky function.

Here's the line in ridge.py that does it

        X, y, X_mean, y_mean, X_std = self._center_data(
        X, y, self.fit_intercept, self.normalize, self.copy_X,
        sample_weight=sample_weight)

Also, it seems you were trying to implicitly account for the intercept by adding the column of 1's. As you see, this is not necessary if you specify fit_intercept=False

Appendix: Does the Ridge class actually implement the direct formula?

It depends on the choice of the solverparameter.

Effectively, if you do not specify the solverparameter in Ridge, it takes by default solver='auto' (which internally resorts to solver='cholesky'). This should be equivalent to the direct computation.

Rigorously, _solve_cholesky uses numpy.linalg.solve instead of numpy.inv. But it can be easily checked that

np.linalg.solve(A.T*A + alpha * I, A.T*b)

yields the same as

np.linalg.inv(A.T*A + alpha * I)*A.T*b
like image 97
JARS Avatar answered Oct 20 '22 21:10

JARS