Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are LASSO in sklearn (python) and matlab statistical package different?

I am using LaasoCV from sklearn to select the best model is selected by cross-validation. I found that the cross validation gives different result if I use sklearn or matlab statistical toolbox.

I used matlab and replicate the example given in http://www.mathworks.se/help/stats/lasso-and-elastic-net.html to get a figure like this

enter image description here

Then I saved the matlab data, and tried to replicate the figure with laaso_path from sklearn, I got

enter image description here

Although there are some similarity between these two figures, there are also certain differences. As far as I understand parameter lambda in matlab and alpha in sklearn are same, however in this figure it seems that there are some differences. Can somebody point out which is the correct one or am I missing something? Further the coefficient obtained are also different (which is my main concern).

Matlab Code:

rng(3,'twister') % for reproducibility
X = zeros(200,5);
for ii = 1:5
      X(:,ii) = exprnd(ii,200,1);
end
r = [0;2;0;-3;0];
Y = X*r + randn(200,1)*.1;

save randomData.mat % To be used in python code

[b fitinfo] = lasso(X,Y,'cv',10);
lassoPlot(b,fitinfo,'plottype','lambda','xscale','log');

disp('Lambda with min MSE')
fitinfo.LambdaMinMSE
disp('Lambda with 1SE')
fitinfo.Lambda1SE
disp('Quality of Fit')
lambdaindex = fitinfo.Index1SE;
fitinfo.MSE(lambdaindex)
disp('Number of non zero predictos')
fitinfo.DF(lambdaindex)
disp('Coefficient of fit at that lambda')
b(:,lambdaindex)

Python Code:

import scipy.io
import numpy as np
import pylab as pl
from sklearn.linear_model import lasso_path, LassoCV

data=scipy.io.loadmat('randomData.mat')
X=data['X']
Y=data['Y'].flatten()

model = LassoCV(cv=10,max_iter=1000).fit(X, Y)
print 'alpha', model.alpha_
print 'coef', model.coef_

eps = 1e-2 # the smaller it is the longer is the path
models = lasso_path(X, Y, eps=eps)
alphas_lasso = np.array([model.alpha for model in models])
coefs_lasso = np.array([model.coef_ for model in models])

pl.figure(1)
ax = pl.gca()
ax.set_color_cycle(2 * ['b', 'r', 'g', 'c', 'k'])
l1 = pl.semilogx(alphas_lasso,coefs_lasso)
pl.gca().invert_xaxis()
pl.xlabel('alpha')
pl.show()
like image 363
imsc Avatar asked Oct 05 '12 12:10

imsc


People also ask

What is Lasso regression in Python?

Lasso regression is an extension of linear regression in the manner that a regularization parameter multiplied by the summation of the absolute value of weights gets added to the loss function (ordinary least squares) of linear regression. Lasso regression is also called regularized linear regression.

What is ridge and lasso regression in Python?

Ridge and Lasso Regression are types of Regularization techniques. Regularization techniques are used to deal with overfitting and when the dataset is large. Ridge and Lasso Regression involve adding penalties to the regression function.

How do you choose between Lasso and Ridge?

Lasso tends to do well if there are a small number of significant parameters and the others are close to zero (ergo: when only a few predictors actually influence the response). Ridge works well if there are many large parameters of about the same value (ergo: when most predictors impact the response).


1 Answers

I do not have matlab but be careful that the value obtained with the cross--validation can be unstable. This is because it influenced by the way you subdivide the samples.

Even if you run 2 times the cross-validation in python you can obtain 2 different results. consider this example :

kf=sklearn.cross_validation.KFold(len(y),n_folds=10,shuffle=True)
cv=sklearn.linear_model.LassoCV(cv=kf,normalize=True).fit(x,y)
print cv.alpha_
kf=sklearn.cross_validation.KFold(len(y),n_folds=10,shuffle=True)
cv=sklearn.linear_model.LassoCV(cv=kf,normalize=True).fit(x,y)
print cv.alpha_

0.00645093258722
0.00691712356467
like image 133
Donbeo Avatar answered Sep 21 '22 08:09

Donbeo