Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are the logistic regression results different between statsmodels and R?

I am trying to compare the logistic regression implementations in python's statsmodels and R.

Python version:

import statsmodels.api as sm
import pandas as pd
import pylab as pl
import numpy as np
df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
df.columns = list(df.columns)[:3] + ["prestige"]
# df.hist()
# pl.show()
dummy_ranks = pd.get_dummies(df["prestige"], prefix="prestige")
cols_to_keep = ["admit", "gre", "gpa"]
data = df[cols_to_keep].join(dummy_ranks.ix[:, "prestige_2":])
data["intercept"] = 1.0
train_cols = data.columns[1:]
logit = sm.Logit(data["admit"], data[train_cols])
result = logit.fit()
result.summary2()

Result:

                         Results: Logit
=================================================================
Model:              Logit            Pseudo R-squared: 0.083     
Dependent Variable: admit            AIC:              470.5175  
Date:               2014-12-19 01:11 BIC:              494.4663  
No. Observations:   400              Log-Likelihood:   -229.26   
Df Model:           5                LL-Null:          -249.99   
Df Residuals:       394              LLR p-value:      7.5782e-08
Converged:          1.0000           Scale:            1.0000    
No. Iterations:     6.0000                                       
------------------------------------------------------------------
               Coef.   Std.Err.     z     P>|z|    [0.025   0.975]
------------------------------------------------------------------
gre            0.0023    0.0011   2.0699  0.0385   0.0001   0.0044
gpa            0.8040    0.3318   2.4231  0.0154   0.1537   1.4544
prestige_2    -0.6754    0.3165  -2.1342  0.0328  -1.2958  -0.0551
prestige_3    -1.3402    0.3453  -3.8812  0.0001  -2.0170  -0.6634
prestige_4    -1.5515    0.4178  -3.7131  0.0002  -2.3704  -0.7325
intercept     -3.9900    1.1400  -3.5001  0.0005  -6.2242  -1.7557
=================================================================

R version:

data = read.csv("http://www.ats.ucla.edu/stat/data/binary.csv", head=T)
require(reshape2)
data1 = dcast(data, admit + gre + gpa ~ rank)
require(dplyr)
names(data1)[4:7] = paste("rank", 1:4, sep="")
data1 = data1[, -4]
summary(glm(admit ~ gre + gpa + rank2 + rank3 + rank4, family=binomial, data=data1))

Result:

Call:
glm(formula = admit ~ gre + gpa + rank2 + rank3 + rank4, family = binomial,
    data = data1)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-1.5133  -0.8661  -0.6573   1.1808   2.0629

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.184029   1.162421  -3.599 0.000319 ***
gre          0.002358   0.001112   2.121 0.033954 *
gpa          0.770591   0.343908   2.241 0.025046 *
rank2       -0.369711   0.310342  -1.191 0.233535
rank3       -1.015012   0.335147  -3.029 0.002457 **
rank4       -1.249251   0.414416  -3.014 0.002574 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 466.13  on 377  degrees of freedom
Residual deviance: 434.12  on 372  degrees of freedom
AIC: 446.12

Number of Fisher Scoring iterations: 4

The results are quite different, for example, the p-values for rank_2 are 0.03 and 0.2 respectively. I am wondering what are causes of this difference? Note that I have created dummy variables for both versions, and a constant column for the python version, which is automatically taken care of in R.

Also, it seems python is 2x faster:

##################################################
# python timing
def test():
    for i in range(5000):
        logit = sm.Logit(data["admit"], data[train_cols])
        result = logit.fit(disp=0)
import time
start = time.time()
test()
print(time.time() - start)
10.099738836288452
##################################################
# R timing
> f = function() for(i in 1:5000) {mod = glm(admit ~ gre + gpa + rank2 + rank3 + rank4, family=binomial, data=data1)}
> system.time(f())
   user  system elapsed
 17.505   0.021  17.526
like image 877
qed Avatar asked Dec 19 '14 00:12

qed


People also ask

Is statsmodels better than Sklearn?

While scikit-learn is slightly faster than statsmodels for 1,000 or less observations, this difference is not significant per the t-test analysis. Sci-kit learn is significantly faster for datasets with more than 1,000 observations.

What is statsmodels logit?

Statsmodels provides a Logit() function for performing logistic regression. The Logit() function accepts y and X as parameters and returns the Logit object. The model is then fitted to the data.

Is Python statsmodels good?

It also has a syntax much closer to R so, for those who are transitioning to Python, StatsModels is a good choice. As expected for something coming from the statistics world, there's an emphasis on understanding the relevant variables and effect size, compared to just finding the model with the best fit.

Why do we use statsmodels in Python?

Statsmodels is a Python package that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.


1 Answers

Not sure what your data manipulations are intending but they seem to be loosing information in the R run. If I keep all the rank information in, then I get this on the original data-object (and the results look very similar in the areas they overlap on. (Likelihoods are only estimated up to an arbitrary constant so you can only compare differences in log-likelihood. Even with that caveat the deviance is supposed to be twice the negative log-likelihood so those results are also comparable.)

> summary(glm(admit ~ gre + gpa +as.factor( rank), family=binomial,
       data=data))  # notice that I'm using your original data-object

Call:
glm(formula = admit ~ gre + gpa + as.factor(rank), family = binomial, 
    data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6268  -0.8662  -0.6388   1.1490   2.0790  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -3.989979   1.139951  -3.500 0.000465 ***
gre               0.002264   0.001094   2.070 0.038465 *  
gpa               0.804038   0.331819   2.423 0.015388 *  
as.factor(rank)2 -0.675443   0.316490  -2.134 0.032829 *  
as.factor(rank)3 -1.340204   0.345306  -3.881 0.000104 ***
as.factor(rank)4 -1.551464   0.417832  -3.713 0.000205 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 458.52  on 394  degrees of freedom
AIC: 470.52

Number of Fisher Scoring iterations: 4
like image 94
IRTFM Avatar answered Sep 21 '22 12:09

IRTFM