I am trying to compare the logistic regression implementations in python's statsmodels and R.
Python version:
import statsmodels.api as sm
import pandas as pd
import pylab as pl
import numpy as np
df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
df.columns = list(df.columns)[:3] + ["prestige"]
# df.hist()
# pl.show()
dummy_ranks = pd.get_dummies(df["prestige"], prefix="prestige")
cols_to_keep = ["admit", "gre", "gpa"]
data = df[cols_to_keep].join(dummy_ranks.ix[:, "prestige_2":])
data["intercept"] = 1.0
train_cols = data.columns[1:]
logit = sm.Logit(data["admit"], data[train_cols])
result = logit.fit()
result.summary2()
Result:
Results: Logit
=================================================================
Model: Logit Pseudo R-squared: 0.083
Dependent Variable: admit AIC: 470.5175
Date: 2014-12-19 01:11 BIC: 494.4663
No. Observations: 400 Log-Likelihood: -229.26
Df Model: 5 LL-Null: -249.99
Df Residuals: 394 LLR p-value: 7.5782e-08
Converged: 1.0000 Scale: 1.0000
No. Iterations: 6.0000
------------------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
------------------------------------------------------------------
gre 0.0023 0.0011 2.0699 0.0385 0.0001 0.0044
gpa 0.8040 0.3318 2.4231 0.0154 0.1537 1.4544
prestige_2 -0.6754 0.3165 -2.1342 0.0328 -1.2958 -0.0551
prestige_3 -1.3402 0.3453 -3.8812 0.0001 -2.0170 -0.6634
prestige_4 -1.5515 0.4178 -3.7131 0.0002 -2.3704 -0.7325
intercept -3.9900 1.1400 -3.5001 0.0005 -6.2242 -1.7557
=================================================================
R version:
data = read.csv("http://www.ats.ucla.edu/stat/data/binary.csv", head=T)
require(reshape2)
data1 = dcast(data, admit + gre + gpa ~ rank)
require(dplyr)
names(data1)[4:7] = paste("rank", 1:4, sep="")
data1 = data1[, -4]
summary(glm(admit ~ gre + gpa + rank2 + rank3 + rank4, family=binomial, data=data1))
Result:
Call:
glm(formula = admit ~ gre + gpa + rank2 + rank3 + rank4, family = binomial,
data = data1)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5133 -0.8661 -0.6573 1.1808 2.0629
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.184029 1.162421 -3.599 0.000319 ***
gre 0.002358 0.001112 2.121 0.033954 *
gpa 0.770591 0.343908 2.241 0.025046 *
rank2 -0.369711 0.310342 -1.191 0.233535
rank3 -1.015012 0.335147 -3.029 0.002457 **
rank4 -1.249251 0.414416 -3.014 0.002574 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 466.13 on 377 degrees of freedom
Residual deviance: 434.12 on 372 degrees of freedom
AIC: 446.12
Number of Fisher Scoring iterations: 4
The results are quite different, for example, the p-values for rank_2 are 0.03 and 0.2 respectively. I am wondering what are causes of this difference? Note that I have created dummy variables for both versions, and a constant column for the python version, which is automatically taken care of in R.
Also, it seems python is 2x faster:
##################################################
# python timing
def test():
for i in range(5000):
logit = sm.Logit(data["admit"], data[train_cols])
result = logit.fit(disp=0)
import time
start = time.time()
test()
print(time.time() - start)
10.099738836288452
##################################################
# R timing
> f = function() for(i in 1:5000) {mod = glm(admit ~ gre + gpa + rank2 + rank3 + rank4, family=binomial, data=data1)}
> system.time(f())
user system elapsed
17.505 0.021 17.526
While scikit-learn is slightly faster than statsmodels for 1,000 or less observations, this difference is not significant per the t-test analysis. Sci-kit learn is significantly faster for datasets with more than 1,000 observations.
Statsmodels provides a Logit() function for performing logistic regression. The Logit() function accepts y and X as parameters and returns the Logit object. The model is then fitted to the data.
It also has a syntax much closer to R so, for those who are transitioning to Python, StatsModels is a good choice. As expected for something coming from the statistics world, there's an emphasis on understanding the relevant variables and effect size, compared to just finding the model with the best fit.
Statsmodels is a Python package that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.
Not sure what your data manipulations are intending but they seem to be loosing information in the R run. If I keep all the rank information in, then I get this on the original data-object (and the results look very similar in the areas they overlap on. (Likelihoods are only estimated up to an arbitrary constant so you can only compare differences in log-likelihood. Even with that caveat the deviance is supposed to be twice the negative log-likelihood so those results are also comparable.)
> summary(glm(admit ~ gre + gpa +as.factor( rank), family=binomial,
data=data)) # notice that I'm using your original data-object
Call:
glm(formula = admit ~ gre + gpa + as.factor(rank), family = binomial,
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6268 -0.8662 -0.6388 1.1490 2.0790
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.989979 1.139951 -3.500 0.000465 ***
gre 0.002264 0.001094 2.070 0.038465 *
gpa 0.804038 0.331819 2.423 0.015388 *
as.factor(rank)2 -0.675443 0.316490 -2.134 0.032829 *
as.factor(rank)3 -1.340204 0.345306 -3.881 0.000104 ***
as.factor(rank)4 -1.551464 0.417832 -3.713 0.000205 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 458.52 on 394 degrees of freedom
AIC: 470.52
Number of Fisher Scoring iterations: 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With