Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparison of R, statmodels, sklearn for a classification task with logistic regression

I have made some experiments with logistic regression in R, python statmodels and sklearn. While the results given by R and statmodels agree, there is some discrepency with what is returned by sklearn. I would like to understand why these results are different. I understand that it is probably not the same optimization algorithms that are used under the wood.

Specifically, I use the standard Default dataset (used in the ISL book). The following Python code reads the data into a dataframe Default.

import pandas as pd
 # data is available here
Default = pd.read_csv('https://d1pqsl2386xqi9.cloudfront.net/notebooks/Default.csv', index_col=0)
 #
Default['default']=Default['default'].map({'No':0, 'Yes':1})
Default['student']=Default['student'].map({'No':0, 'Yes':1})
 #
I=Default['default']==0
print("Number of 'default' values :", Default[~I]['balance'].count())

Number of 'default' values : 333.

There is a total of 10000 examples, with only 333 positives

Logistic regression in R

I use the following

library("ISLR")
data(Default,package='ISLR')
 #write.csv(Default,"default.csv")
glm.out=glm('default~balance+income+student', family=binomial, data=Default)
s=summary(glm.out)
print(s)
#
glm.probs=predict(glm.out,type="response") 
glm.probs[1:5]
glm.pred=ifelse(glm.probs>0.5,"Yes","No")
 #attach(Default)
t=table(glm.pred,Default$default)
print(t)
score=mean(glm.pred==Default$default)
print(paste("score",score))

The result is as follows

Call: glm(formula = "default~balance+income+student", family = binomial, data = Default)

Deviance Residuals: Min 1Q Median 3Q Max
-2.4691 -0.1418 -0.0557 -0.0203 3.7383

Coefficients:

Estimate Std. Error z value Pr(>|z|)        
(Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 
balance      5.737e-03    2.319e-04  24.738  < 2e-16  
income       3.033e-06  8.203e-06   0.370   0.71152     
studentYes  -6.468e-01  2.363e-01  -2.738  0.00619  

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2920.6  on 9999  degrees of freedom Residual 

deviance: 1571.5 on 9996 degrees of freedom AIC: 1579.5

Number of Fisher Scoring iterations: 8

     glm.pred   No  Yes
 No  9627  228
 Yes   40  105 

1 "score 0.9732"

I am too lazy to cut and paste the results obtained with statmodels. It suffice to say that they are extremely similar to those given by R.

sklearn

For sklearn, I ran the following code.

  • There is a parameter class_weight for taking into account unbalanced classes. I tested class_weight=None (no weightening -- I think that is the default in R), and class_weight='auto' (weightening with the inverse frequencies found inthe data)
  • I also put C=10000,the inverse of the regularization parameter, so as to minimize the effect of regularization.

~~

import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

features = Default[[ 'balance', 'income' ]]
target = Default['default']
# 
for weight in (None,  'auto'):
    print("*"*40+"\nweight:",weight)

    classifier = LogisticRegression(C=10000, class_weight=weight, random_state=42) 
                #C=10000 ~ no regularization

    classifier.fit(features, target,)  #fit classifier on whole base
    print("Intercept", classifier.intercept_)
    print("Coefficients", classifier.coef_)

    y_true=target
    y_pred_cls=classifier.predict_proba(features)[:,1]>0.5
    C=confusion_matrix(y_true,y_pred_cls)

    score=(C[0,0]+C[1,1])/(C[0,0]+C[1,1]+C[0,1]+C[1,0])
    precision=(C[1,1])/(C[1,1]+C[0 ,1])
    recall=(C[1,1])/(C[1,1]+C[1,0])
    print("\n Confusion matrix")
    print(C)
    print()
    print('{s:{c}<{n}}{num:2.4}'.format(s='Score',n=15,c='', num=score))
    print('{s:{c}<{n}}{num:2.4}'.format(s='Precision',n=15,c='', num=precision))
    print('{s:{c}<{n}}{num:2.4}'.format(s='Recall',n=15,c='', num=recall))

The results are given below.

> **************************************** 
>weight: None 
>
>Intercept [ -1.94164126e-06] 
>
>Coefficients [[ 0.00040756 -0.00012588]]
> 
>  Confusion matrix 
>
>     [[9664    3]  
>     [ 333    0]]
> 
>     Score          0.9664 
>     Precision      0.0 
>     Recall         0.0
>
> **************************************** 
>weight: auto 
>
>Intercept [-8.15376429] 
>
>Coefficients 
>[[  5.67564834e-03   1.95253338e-05]]
> 
>  Confusion matrix 
>
>     [[8356 1311]  
>     [  34  299]]
> 
>     Score          0.8655 
>     Precision      0.1857 
>     Recall         0.8979

What I observe is that for class_weight=None, the Score is excellent but no positive example is recognized. Precision and recall are at zero. The coefficients found are very small, particularly the intercept. Modifying C does not change things. For class_weight='auto' things seems better but I still have a precision which is very low (too much positive classified). Again, changing C does not help. If I modify the intercept by hand, I can recover the results given by R. So I suspect that here is a discrepency between the estimation of the intecepts in the two cases. As this has a consequence in the specification of the threeshold (analog to a resampling of pulations), this can explain the differences in performances.

However, I would welcome any advice for the choice between the two solutions and help to understand the origin of these differences. Thanks.

like image 519
jfb Avatar asked Feb 26 '15 15:02

jfb


People also ask

What is logistic regression in Statsmodels?

Logistic Regression using Statsmodels. Logistic regression is the type of regression analysis used to find the probability of a certain event occurring. It is the best suited type of regression for cases where we have a categorical dependent variable which can take only discrete values.

When should I use Statsmodels over R?

When you need a variety of linear regression models, mixed linear models, regression with discrete dependent variables, and more – StatsModels has options. It also has a syntax much closer to R so, for those who are transitioning to Python, StatsModels is a good choice.

What is the difference between Statsmodels and scikit-learn?

The differences between them highlight what each in particular has to offer: scikit-learn’s other popular topics are machine-learning and data-science; StatsModels are econometrics, generalized-linear-models, timeseries-analysis, and regression-models.

Is Statsmodels good for econometrics?

Though StatsModels doesn’t have this variety of options, it offers statistics and econometric tools that are top of the line and validated against other statistics software like Stata and R. When you need a variety of linear regression models, mixed linear models, regression with discrete dependent variables, and more – StatsModels has options.


2 Answers

I ran into a similar issue and ended up posting about it on /r/MachineLearning. It turns out the difference can be attributed to data standardization. Whatever approach scikit-learn is using to find the parameters of the model will yield better results if the data is standardized. scikit-learn has some documentation discussing preprocessing data (including standardization), which can be found here.

Results

Number of 'default' values : 333
Intercept: [-6.12556565]
Coefficients: [[ 2.73145133  0.27750788]]

Confusion matrix
[[9629   38]
 [ 225  108]]

Score          0.9737
Precision      0.7397
Recall         0.3243

Code

# scikit-learn vs. R
# http://stackoverflow.com/questions/28747019/comparison-of-r-statmodels-sklearn-for-a-classification-task-with-logistic-reg

import pandas as pd
import sklearn

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn import preprocessing

# Data is available here.
Default = pd.read_csv('https://d1pqsl2386xqi9.cloudfront.net/notebooks/Default.csv', index_col = 0)

Default['default'] = Default['default'].map({'No':0, 'Yes':1})
Default['student'] = Default['student'].map({'No':0, 'Yes':1})

I = Default['default'] == 0
print("Number of 'default' values : {0}".format(Default[~I]['balance'].count()))

feats = ['balance', 'income']

Default[feats] = preprocessing.scale(Default[feats])

# C = 1e6 ~ no regularization.
classifier = LogisticRegression(C = 1e6, random_state = 42) 

classifier.fit(Default[feats], Default['default'])  #fit classifier on whole base
print("Intercept: {0}".format(classifier.intercept_))
print("Coefficients: {0}".format(classifier.coef_))

y_true = Default['default']
y_pred_cls = classifier.predict_proba(Default[feats])[:,1] > 0.5

confusion = confusion_matrix(y_true, y_pred_cls)
score = float((confusion[0, 0] + confusion[1, 1])) / float((confusion[0, 0] + confusion[1, 1] + confusion[0, 1] + confusion[1, 0]))
precision = float((confusion[1, 1])) / float((confusion[1, 1] + confusion[0, 1]))
recall = float((confusion[1, 1])) / float((confusion[1, 1] + confusion[1, 0]))
print("\nConfusion matrix")
print(confusion)
print('\n{s:{c}<{n}}{num:2.4}'.format(s = 'Score', n = 15, c = '', num = score))
print('{s:{c}<{n}}{num:2.4}'.format(s = 'Precision', n = 15, c = '', num = precision))
print('{s:{c}<{n}}{num:2.4}'.format(s = 'Recall', n = 15, c = '', num = recall))
like image 116
airalcorn2 Avatar answered Nov 06 '22 00:11

airalcorn2


Although this post is old, I wanted to give you a solution. In your post you are comparing apples with oranges. In your R code, you are estimating "balance, income, and student" on "default". In your Python code, you are only estimating "balance and income" on "default". Of course, you cannot get the same estimates. Also the differences cannot be attributed to feature scaling, as logistic regression usually does not need it in comparison to kmeans.

You are right to set a high C, so that there is no regularization. If you want to have the same output as in R, you have to change the solver to "newton-cg". Different solvers can give different results but they still yield the same objective value. As long as your solver converge everything will be okay.

Here's the code that give you the same estimates like in R and Statsmodels:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from patsy import dmatrices # 
import numpy as np

 # data is available here
Default = pd.read_csv('https://d1pqsl2386xqi9.cloudfront.net/notebooks/Default.csv', index_col=0)
 #
Default['default']=Default['default'].map({'No':0, 'Yes':1})
Default['student']=Default['student'].map({'No':0, 'Yes':1})

# use dmatrices to get data frame for logistic regression
y, X = dmatrices('default ~ balance+income+C(student)',
                  Default,return_type="dataframe")

y = np.ravel(y)

# fit logistic regression
model = LogisticRegression(C = 1e6, fit_intercept=False, solver = "newton-cg", max_iter=10000000)
model = model.fit(X, y)

# examine the coefficients
pd.DataFrame(zip(X.columns, np.transpose(model.coef_)))
like image 21
Dat Tran Avatar answered Nov 06 '22 00:11

Dat Tran