When running a logistic regression, the coefficients I get using statsmodels
are correct (verified them with some course material). However, I am unable to get the same coefficients with sklearn
. I've tried preprocessing the data to no avail. This is my code:
Statsmodels:
import statsmodels.api as sm
X_const = sm.add_constant(X)
model = sm.Logit(y, X_const)
results = model.fit()
print(results.summary())
The relevant output is:
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.2382 3.983 -0.060 0.952 -8.045 7.569
a 2.0349 0.837 2.430 0.015 0.393 3.676
b 0.8077 0.823 0.981 0.327 -0.806 2.421
c 1.4572 0.768 1.897 0.058 -0.049 2.963
d -0.0522 0.063 -0.828 0.407 -0.176 0.071
e_2 0.9157 1.082 0.846 0.397 -1.205 3.037
e_3 2.0080 1.052 1.909 0.056 -0.054 4.070
Scikit-learn (no preprocessing)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
results = model.fit(X, y)
print(results.coef_)
print(results.intercept_)
The coefficients given are:
array([[ 1.29779008, 0.56524976, 0.97268593, -0.03762884, 0.33646097,
0.98020901]])
And the intercept/constant given is:
array([ 0.0949539])
As you can see, regardless of which coefficient corresponds to which variable, the numbers given by sklearn
don't match the correct ones from statsmodels
. What am I missing? Thanks in advance!
Sklearn and Pandas are more active than Statsmodels. The clear choice is Sklearn. It is easy and clear how to perform it.
Linear regression is in its basic form the same in statsmodels and in scikit-learn. However, the implementation differs which might produce different results in edge cases, and scikit learn has in general more support for larger models. For example, statsmodels currently uses sparse matrices in very few parts.
The solvers implemented in the class Logistic Regression are “liblinear”, “newton-cg”, “lbfgs”, “sag” and “saga”. In a nutshell, the following table summarizes the solvers characteristics: The “saga” solver is often the best choice. The “liblinear” solver is used by default for historical reasons.
Statsmodels provides a Logit() function for performing logistic regression. The Logit() function accepts y and X as parameters and returns the Logit object. The model is then fitted to the data.
Thanks to a kind soul on reddit, this was solved. To get the same coefficients, one has to negate the regularisation that sklearn
applies to logistic regression by default:
model = LogisticRegression(C=1e8)
Where C
according to the documentation is:
C : float, default: 1.0
Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With