I'm pretty sure it's a feature, not a bug, but I would like to know if there is a way to make sklearn
and statsmodels
match in their logit estimates. A very simple example:
import numpy as np
import statsmodels.formula.api as sm
from sklearn.linear_model import LogisticRegression
np.random.seed(123)
n = 100
y = np.random.random_integers(0, 1, n)
x = np.random.random((n, 2))
# Constant term
x[:, 0] = 1.
The estimates with statsmodels
:
sm_lgt = sm.Logit(y, x).fit()
Optimization terminated successfully.
Current function value: 0.675320
Iterations 4
print sm_lgt.params
[ 0.38442 -1.1429183]
And the estimates with sklearn
:
sk_lgt = LogisticRegression(fit_intercept=False).fit(x, y)
print sk_lgt.coef_
[[ 0.16546794 -0.72637982]]
I think it's got to do with the implementation in sklearn
, which uses some sort of regularization. Is there an option to estimate a barebones logit as in statsmodels
(it's substantially faster and scales much more nicely). Also, does sklearn
provide inference (standard errors) or marginal effects?
Is there an option to estimate a barebones logit as in
statsmodels
You can set the C
(inverse regularization strength) parameter to an arbitrarily high constant, as long as it's finite:
>>> sk_lgt = LogisticRegression(fit_intercept=False, C=1e9).fit(x, y)
>>> print(sk_lgt.coef_)
[[ 0.38440594 -1.14287175]]
Turning the regularization off is impossible because this is not supported by the underlying solver, Liblinear.
Also, does
sklearn
provide inference (standard errors) or marginal effects?
No. There's a proposal to add this, but it's not in the master codebase yet.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With