Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

glm in python vs R

I have some dataset: titanic

Doing this in R

glm(Survived ~ Sex, titanic, family = "binomial")

I get

(Intercept)     SexMale 
1.124321   -2.477825

R takes survived as positive outcome.

But when I'm doing the same in Python

sm.formula.glm("Survived ~ Sex", family=sm.families.Binomial(), data=titanic).fit()

I get negative results: i.e. Python takes not survived as positive outcome.

How can I adjust Python's glm function behavior so it will return the same result as R does?

like image 878
user2528473 Avatar asked Nov 29 '17 11:11

user2528473


1 Answers

You just need to set your reference group to either male or female (depending on what you're interested in):

With a small test dataset in R, the code and model summary looks like this:

df <- data.frame(c(0,0,1,1,0), c("Male", "Female", "Female", "Male", "Male"))
colnames(df) <- c("Survived", "Sex")

model <- glm(Survived ~ Sex, data=df, family="binomial")
summary(model)

Output:

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.084e-16  1.414e+00   0.000    1.000
SexMale     -6.931e-01  1.871e+00  -0.371    0.711

To get something similar in Python/statsmodels:

import pandas as pd
import statsmodels.api as sm

df = pd.DataFrame({"Survived": [0,0,1,1,0],
                   "Sex": ["Male", "Female", "Female", "Male", "Male"]})

model = sm.formula.glm("Survived ~ C(Sex, Treatment(reference='Female'))",
                       family=sm.families.Binomial(), data=df).fit()
print(model.summary())

Which will give:

                                                    coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------------
Intercept                                      5.551e-16      1.414   3.93e-16      1.000      -2.772       2.772
C(Sex, Treatment(reference='Female'))[T.Male]    -0.6931      1.871     -0.371      0.711      -4.360       2.974

Notice the use of Treatment() to set the reference group. I've set it to Female in this case to match the R output, but with your dataset it might make more sense to use Male. Either way, its just an issue of being explicit about which group is used as reference.

like image 111
Simon Avatar answered Nov 01 '22 06:11

Simon