In Ridge Regression, we are solving Ax=b
with L2
Regularization. The direct calculation is given by:
x = (ATA + alpha * I)-1ATb
I have looked at the scikit-learn code and they do implement the same calculation. But, I can't seem to get the same results for alpha > 0
The minimal code to reproduce this.
import numpy as np
A = np.asmatrix(np.c_[np.ones((10,1)),np.random.rand(10,3)])
b = np.asmatrix(np.random.rand(10,1))
I = np.identity(A.shape[1])
alpha = 1
x = np.linalg.inv(A.T*A + alpha * I)*A.T*b
print(x.T)
>>> [[ 0.37371021 0.19558433 0.06065241 0.17030177]]
from sklearn.linear_model import Ridge
model = Ridge(alpha = alpha).fit(A[:,1:],b)
print(np.c_[model.intercept_, model.coef_])
>>> [[ 0.61241566 0.02727579 -0.06363385 0.05303027]]
Any suggestions on what I can do to resolve this discrepancy?
If Alpha is close to zero, the Ridge term itself is very small and thus the final error is based on RSS alone. If Alpha is too large, the impact of shrinkage grows and the coefficients B1, B2 ... Bn tends to zero. Choosing the right value helps the model learn the right features and better generalize the coefficients.
The last line of code prints the optimum values, which come out to be 0.4322 for alpha and 0.0065 for lambda. Once we have trained the model, we use it to generate the predictions and print the evaluation results for both the training and test datasets, using the lines of code below.
The L2 norm term in ridge regression is weighted by the regularization parameter alpha. So, if the alpha value is 0, it means that it is just an Ordinary Least Squares Regression model. So, the larger is the alpha, the higher is the smoothness constraint.
Here, α (alpha) is the parameter which balances the amount of emphasis given to minimizing RSS vs minimizing sum of square of coefficients. α can take various values: α = 0: The objective becomes same as simple linear regression. We'll get the same coefficients as simple linear regression.
This modification seems to yield the same result for the direct version and the numpy version:
import numpy as np
A = np.asmatrix(np.random.rand(10,3))
b = np.asmatrix(np.random.rand(10,1))
I = np.identity(A.shape[1])
alpha = 1
x = np.linalg.inv(A.T*A + alpha * I)*A.T*b
print (x.T)
from sklearn.linear_model import Ridge
model = Ridge(alpha = alpha, tol=0.1, fit_intercept=False).fit(A ,b)
print model.coef_
print model.intercept_
It seems the main reason for the difference is the class Ridge
has the parameter fit_intercept=True
(by inheritance from class _BaseRidge
) (source)
This is applying a data centering procedure before passing the matrices to the _solve_cholesky
function.
Here's the line in ridge.py that does it
X, y, X_mean, y_mean, X_std = self._center_data(
X, y, self.fit_intercept, self.normalize, self.copy_X,
sample_weight=sample_weight)
Also, it seems you were trying to implicitly account for the intercept by adding the column of 1's. As you see, this is not necessary if you specify fit_intercept=False
Appendix: Does the Ridge class actually implement the direct formula?
It depends on the choice of the solver
parameter.
Effectively, if you do not specify the solver
parameter in Ridge
, it takes by default solver='auto'
(which internally resorts to solver='cholesky'
). This should be equivalent to the direct computation.
Rigorously, _solve_cholesky
uses numpy.linalg.solve
instead of numpy.inv
. But it can be easily checked that
np.linalg.solve(A.T*A + alpha * I, A.T*b)
yields the same as
np.linalg.inv(A.T*A + alpha * I)*A.T*b
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With