Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R lm versus Python sklearn linear_model

When I study Python SKlearn, the first example that I come across is Generalized Linear Models.

Code of its very first example:

from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit([[0, 0], [1, 1], [2,2]], [0, 1,2])
reg.fit
reg.coef_
array([ 0.5,  0.5])

Here I assume [[0, 0], [1, 1], [2,2]] represents a data.frame containing x1 = c(0,1,2) and x2 = c(0,1,2) and y = c(0,1,2) as well.

Immediately, I begin to think that array([ 0.5, 0.5]) are the coeffs for x1 and x2.

But, are there standard errors for those estimates? What about t tests p values, R2 and other figures?

Then I try to do the same thing in R.

X = data.frame(x1 = c(0,1,2),x2 = c(0,1,2),y = c(0,1,2))
lm(data=X, y~x1+x2)
Call:
lm(formula = y ~ x1 + x2, data = X)

#Coefficients:
#(Intercept)           x1           x2  
#  1.282e-16    1.000e+00           NA  

Obviously x1 and x2 are completely linearly dependent so the OLS will fail. Why the SKlearn still works and gives this results? Am I getting sklearn in a wrong way? Thanks.

like image 450
John Avatar asked Feb 05 '23 23:02

John


1 Answers

Both solutions are correct (assuming that NA behaves like a zero). Which solution is favored depends on the numerical solver used by the OLS estimator.

sklearn.linear_model.LinearRegression is based on scipy.linalg.lstsq which in turn calls the LAPACK gelsd routine which is described here:

http://www.netlib.org/lapack/lug/node27.html

In particular it says that when the problem is rank deficient it seeks the minimum norm least squares solution.

If you want to favor the other solution, you can use a coordinate descent solver with a tiny bit of L1 penalty as implemented in th Lasso class:

>>> from sklearn.linear_model import Lasso
>>> reg = Lasso(alpha=1e-8)
>>> reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])

Lasso(alpha=1e-08, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
>>> reg.coef_
array([  9.99999985e-01,   3.97204719e-17])
like image 81
ogrisel Avatar answered Feb 08 '23 14:02

ogrisel