Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multinomial Logit model Python and Stata different results

I am trying to build multinomial logit model using python and stata. My data is as follows:

    ses_type prog_type  read  write  math  prog  ses 
0        low   Diploma  39.2   40.2  46.2     0     0
1     middle   general  39.2   38.2  46.2     1     1
2       high   Diploma  44.5   44.5  49.5     0     2
3        low   Diploma  43.0   43.0  48.0     0     0
4     middle   Diploma  44.5   36.5  45.5     0     1
5       high   general  47.3   41.3  47.3     1     2

I am trying to predict prog using ses read write and math. Where ses represent socioeconomic status and is a nominal variable therefore I created my model in stata using following command:

mlogit prog i.ses read write math, base(2)

Stata output is as follows:

Iteration 0:   log likelihood = -204.09667  
Iteration 1:   log likelihood = -171.90258  
Iteration 2:   log likelihood = -170.13513  
Iteration 3:   log likelihood = -170.11071  
Iteration 4:   log likelihood =  -170.1107  

Multinomial logistic regression                 Number of obs     =        200
                                                LR chi2(10)       =      67.97
                                                Prob > chi2       =     0.0000
Log likelihood =  -170.1107                     Pseudo R2         =     0.1665

------------------------------------------------------------------------------
        prog |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
0            |
         ses |
          1  |   .6197969   .5059335     1.23   0.221    -.3718146    1.611408
          2  |  -.5131952   .6280601    -0.82   0.414     -1.74417    .7177799
             |
        read |  -.0405302   .0289314    -1.40   0.161    -.0972346    .0161742
       write |  -.0459711   .0270153    -1.70   0.089      -.09892    .0069779
        math |  -.0990497   .0331576    -2.99   0.003    -.1640373   -.0340621
       _cons |   9.544131   1.738404     5.49   0.000     6.136921    12.95134
-------------+----------------------------------------------------------------
1            |
         ses |
          1  |  -.3350861   .4607246    -0.73   0.467     -1.23809    .5679176
          2  |  -.8687013   .5363968    -1.62   0.105     -1.92002     .182617
             |
        read |  -.0226249   .0264534    -0.86   0.392    -.0744726    .0292228
       write |   -.011618   .0266782    -0.44   0.663    -.0639063    .0406703
        math |  -.0591301   .0299996    -1.97   0.049    -.1179283    -.000332
       _cons |   5.041193   1.524174     3.31   0.001     2.053866    8.028519
-------------+----------------------------------------------------------------
2            |  (base outcome)
------------------------------------------------------------------------------

I tried to replicate the same results using scikit learn module in python. Following is the code:

data = pd.read_csv("C://Users/Furqan/Desktop/random_data.csv")


train_x = np.array(data[['read', 'write', 'math','ses ']])
train_y = np.array(data['prog'])

mul_lr = linear_model.LogisticRegression(multi_class='multinomial',
                                         solver='newton-cg').fit(train_x, train_y)

print(mul_lr.intercept_)
print(mul_lr.coef_)

The output values (intercept and coefficient) are as follows:

[ 4.76438772  0.19347405 -4.95786177]

[[-0.01735513 -0.02731273 -0.04463257  0.01721334]
 [-0.00319366  0.00783135 -0.00689664 -0.24480926]
 [ 0.02054879  0.01948137  0.05152921  0.22759592]]

The values turn out to be different.

My first question is why the results tend to be different?

My second question is that in case of nominal predictor variable how can we instruct python that ses is an indicator variable?

EDIT:

Link to data file

like image 757
Furqan Hashim Avatar asked Mar 03 '18 16:03

Furqan Hashim


1 Answers

There are several issues that make Stata and sklearn results differ:

  1. Different actual predictors in Stata and sklearn
  2. Different representations of fitted parameters
  3. Different goal functions when fitting the model

We need to change all the three conditions to achieve similar outputs.

1. Making the dummy variables

The formula that Stata uses for the linear part is

 prediction = a0 + a1 * [ses==1] + a2 * [ses==2] + a3 * read + a4 * write + a5 * math

Sklearn, in turn, knows nothing about the categorical nature of ses, and tries to use

 prediction = a0 + a1 * ses + a3 * read + a4 * write + a5 * math

To enable categorical predictions, you need to preprocess the data. This is the only possible way to include categorical variables into an sklearn logistic regression. I find pd.get_dummies() the most convenient way to do this.

The following code creates the dummy variable for ses and then drops the "low" level, which evidently corresponds to ses=0 in your example:

import pandas as pd, numpy as np
from sklearn import linear_model

data = pd.read_csv("d1.csv", sep='\t')
data.columns = data.columns.str.strip()

raw_x = data.drop('prog', axis=1)
# making the dummies
train_x = pd.get_dummies(raw_x, columns=['ses']).drop('ses_low ', axis=1)
print(train_x.columns)
train_y = data['prog']

mul_lr = linear_model.LogisticRegression(multi_class='multinomial',
                                         solver='newton-cg').fit(train_x, train_y)
reorder = [4, 3, 0, 1, 2] # the order in which coefficents show up in Stata

print(mul_lr.intercept_)
print(mul_lr.coef_[:, reorder])

It outputs

['read', 'write', 'math', 'ses_high ', 'ses_middle ']
[ 4.67331919  0.19082335 -4.86414254]
[[ 0.47140512 -0.08236331 -0.01909793 -0.02680609 -0.04587383]
 [-0.36381476 -0.33294749 -0.0021255   0.00765828 -0.00703075]
 [-0.10759035  0.4153108   0.02122343  0.01914781  0.05290458]]

You see that Python has successfully encoded sess into 'ses_high ' and 'ses_middle ', but failed to produce the expected coefficients.

By the way, I have changed the order of coef_ columns in the output, to have it look like in Stata.

2. Rearranging the results

That's because Stata views the third category of the outcome (prog=='honors ') as base outcome, and subtracts all its parameters from the rest of the parameters. In Python, you can reproduce this by running

print(mul_lr.intercept_ - mul_lr.intercept_[-1])
print((mul_lr.coef_  - mul_lr.coef_[-1])[:, reorder])

which gives you

[9.53746174 5.0549659  0.        ]
[[ 0.57899547 -0.4976741  -0.04032136 -0.0459539  -0.09877841]
 [-0.25622441 -0.74825829 -0.02334893 -0.01148954 -0.05993533]
 [ 0.          0.          0.          0.          0.        ]]

You can see now that the parameters now are close to what Stata gives:

  • intercepts of (9.53, 5.05) in Python vs (9.54, 5.04) in Stata
  • first-outcome coefficients (0.57, -0.49, ...) vs (0.61, -0.51, ...)
  • second-outcome coefficients (-0.25, -0.74, ...) vs (-0.33, -0.86, ...)

Can you see the pattern? In sklearn, slope coefficients are smaller (closer to zero) than in Stata. This is not an accident!

3. Dealing with regularization

This happens because sklearn intentionally shrinks the slope coefficients towards 0, by adding a quadratic penalty on coefficients to the likelihood function it maximizes. This makes the estimates biased but more stable, even in case of harsh multicollinearity. In Bayesian terms, such regularization corresponds to a zero-mean Gaussian prior on all the coefficients. You can learn more about regularization in the wiki.

In sklearn, this quadratic penalty is controlled by the positive C parameter: the smaller it is, the more regularization you get. You can think of it as the prior variance of each slope coefficient. The default value is C=1, but you can make it larger, like C=1000000, which means almost no regularization. In this case, the output is almost identical to that of Stata:

mul_lr2 = linear_model.LogisticRegression(
    multi_class='multinomial', solver='newton-cg', C=1000000
).fit(train_x, train_y)
print(mul_lr2.intercept_ - mul_lr2.intercept_[-1])
print((mul_lr2.coef_  - mul_lr2.coef_[-1])[:, reorder])

which gives you

[9.54412644 5.04126452 0.        ]
[[ 0.61978951 -0.51320481 -0.04053013 -0.0459711  -0.09904948]
 [-0.33508605 -0.86869799 -0.02262518 -0.01161839 -0.05913068]
 [ 0.          0.          0.          0.          0.        ]]

The results still differ a little (like in the 5-th decimal), but with even less regularization the difference fill shrink further.

like image 117
David Dale Avatar answered Oct 01 '22 00:10

David Dale