Why are the logistic regression results different between statsmodels and R?

Tags:

I am trying to compare the logistic regression implementations in python's statsmodels and R.

Python version:

import statsmodels.api as sm
import pandas as pd
import pylab as pl
import numpy as np
df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
df.columns = list(df.columns)[:3] + ["prestige"]
# df.hist()
# pl.show()
dummy_ranks = pd.get_dummies(df["prestige"], prefix="prestige")
cols_to_keep = ["admit", "gre", "gpa"]
data = df[cols_to_keep].join(dummy_ranks.ix[:, "prestige_2":])
data["intercept"] = 1.0
train_cols = data.columns[1:]
logit = sm.Logit(data["admit"], data[train_cols])
result = logit.fit()
result.summary2()

Result:

                         Results: Logit
=================================================================
Model:              Logit            Pseudo R-squared: 0.083     
Dependent Variable: admit            AIC:              470.5175  
Date:               2014-12-19 01:11 BIC:              494.4663  
No. Observations:   400              Log-Likelihood:   -229.26   
Df Model:           5                LL-Null:          -249.99   
Df Residuals:       394              LLR p-value:      7.5782e-08
Converged:          1.0000           Scale:            1.0000    
No. Iterations:     6.0000                                       
------------------------------------------------------------------
               Coef.   Std.Err.     z     P>|z|    [0.025   0.975]
------------------------------------------------------------------
gre            0.0023    0.0011   2.0699  0.0385   0.0001   0.0044
gpa            0.8040    0.3318   2.4231  0.0154   0.1537   1.4544
prestige_2    -0.6754    0.3165  -2.1342  0.0328  -1.2958  -0.0551
prestige_3    -1.3402    0.3453  -3.8812  0.0001  -2.0170  -0.6634
prestige_4    -1.5515    0.4178  -3.7131  0.0002  -2.3704  -0.7325
intercept     -3.9900    1.1400  -3.5001  0.0005  -6.2242  -1.7557
=================================================================

R version:

data = read.csv("http://www.ats.ucla.edu/stat/data/binary.csv", head=T)
require(reshape2)
data1 = dcast(data, admit + gre + gpa ~ rank)
require(dplyr)
names(data1)[4:7] = paste("rank", 1:4, sep="")
data1 = data1[, -4]
summary(glm(admit ~ gre + gpa + rank2 + rank3 + rank4, family=binomial, data=data1))

Result:

Call:
glm(formula = admit ~ gre + gpa + rank2 + rank3 + rank4, family = binomial,
    data = data1)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-1.5133  -0.8661  -0.6573   1.1808   2.0629

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.184029   1.162421  -3.599 0.000319 ***
gre          0.002358   0.001112   2.121 0.033954 *
gpa          0.770591   0.343908   2.241 0.025046 *
rank2       -0.369711   0.310342  -1.191 0.233535
rank3       -1.015012   0.335147  -3.029 0.002457 **
rank4       -1.249251   0.414416  -3.014 0.002574 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 466.13  on 377  degrees of freedom
Residual deviance: 434.12  on 372  degrees of freedom
AIC: 446.12

Number of Fisher Scoring iterations: 4

The results are quite different, for example, the p-values for rank_2 are 0.03 and 0.2 respectively. I am wondering what are causes of this difference? Note that I have created dummy variables for both versions, and a constant column for the python version, which is automatically taken care of in R.

Also, it seems python is 2x faster:

##################################################
# python timing
def test():
    for i in range(5000):
        logit = sm.Logit(data["admit"], data[train_cols])
        result = logit.fit(disp=0)
import time
start = time.time()
test()
print(time.time() - start)
10.099738836288452
##################################################
# R timing
> f = function() for(i in 1:5000) {mod = glm(admit ~ gre + gpa + rank2 + rank3 + rank4, family=binomial, data=data1)}
> system.time(f())
   user  system elapsed
 17.505   0.021  17.526

877

asked Dec 19 '14 00:12

qed

1 Answers

Not sure what your data manipulations are intending but they seem to be loosing information in the R run. If I keep all the rank information in, then I get this on the original data-object (and the results look very similar in the areas they overlap on. (Likelihoods are only estimated up to an arbitrary constant so you can only compare differences in log-likelihood. Even with that caveat the deviance is supposed to be twice the negative log-likelihood so those results are also comparable.)

> summary(glm(admit ~ gre + gpa +as.factor( rank), family=binomial,
       data=data))  # notice that I'm using your original data-object

Call:
glm(formula = admit ~ gre + gpa + as.factor(rank), family = binomial, 
    data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6268  -0.8662  -0.6388   1.1490   2.0790  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -3.989979   1.139951  -3.500 0.000465 ***
gre               0.002264   0.001094   2.070 0.038465 *  
gpa               0.804038   0.331819   2.423 0.015388 *  
as.factor(rank)2 -0.675443   0.316490  -2.134 0.032829 *  
as.factor(rank)3 -1.340204   0.345306  -3.881 0.000104 ***
as.factor(rank)4 -1.551464   0.417832  -3.713 0.000205 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 458.52  on 394  degrees of freedom
AIC: 470.52

Number of Fisher Scoring iterations: 4

answered Sep 21 '22 12:09

IRTFM

Related questions
                            
                                How can I make PyCharm's Fill Paragraph command override the default margin?
                            
                                Trying to grow a nested dictionary by adding more key:value pairs
                            
                                Handling Pandas dataframe columns with mixed date formats
                            
                                how to divide rdd data into two in spark?
                            
                                How to fit an ARMAX model using statsmodels
                            
                                Exhaust a list of generators in a cyclic fashion
                            
                                Caching Flask-Login user_loader
                            
                                Get app config from database entries
                            
                                Maintain updated file cache of web pages in Python?
                            
                                Install sympy for python34 in windows 8
                            
                                Does `try... except Exception as e` catch every possible exception?
                            
                                Lazy-loading variables using overloaded decorators
                            
                                django tinymce not showing rich textarea
                            
                                Change Timezone for Date object Python
                            
                                create multiple objects using writable nested serializers in django rest framework 3
                            
                                Catching exceptions that don't inherit from Exception
                            
                                Python cmd on linux does not autocomplete special characters or symbols
                            
                                Force repr() to use single quotes
                            
                                Python: Iterate through object executing code both at certain places and also at end
                            
                                How can I use tweepy without a filter

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why are the logistic regression results different between statsmodels and R?

Tags:

python

r

logistic-regression

statsmodels

qed

People also ask

1 Answers

IRTFM

Recent Activity

Donate For Us