glm in python vs R

Question

I have some dataset: titanic

Doing this in R

glm(Survived ~ Sex, titanic, family = "binomial")

I get

(Intercept)     SexMale 
1.124321   -2.477825

R takes survived as positive outcome.

But when I'm doing the same in Python

sm.formula.glm("Survived ~ Sex", family=sm.families.Binomial(), data=titanic).fit()

I get negative results: i.e. Python takes not survived as positive outcome.

How can I adjust Python's glm function behavior so it will return the same result as R does?

Simon · Accepted Answer

You just need to set your reference group to either male or female (depending on what you're interested in):

With a small test dataset in R, the code and model summary looks like this:

df <- data.frame(c(0,0,1,1,0), c("Male", "Female", "Female", "Male", "Male"))
colnames(df) <- c("Survived", "Sex")

model <- glm(Survived ~ Sex, data=df, family="binomial")
summary(model)

Output:

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.084e-16  1.414e+00   0.000    1.000
SexMale     -6.931e-01  1.871e+00  -0.371    0.711

To get something similar in Python/statsmodels:

import pandas as pd
import statsmodels.api as sm

df = pd.DataFrame({"Survived": [0,0,1,1,0],
                   "Sex": ["Male", "Female", "Female", "Male", "Male"]})

model = sm.formula.glm("Survived ~ C(Sex, Treatment(reference='Female'))",
                       family=sm.families.Binomial(), data=df).fit()
print(model.summary())

Which will give:

                                                    coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------------
Intercept                                      5.551e-16      1.414   3.93e-16      1.000      -2.772       2.772
C(Sex, Treatment(reference='Female'))[T.Male]    -0.6931      1.871     -0.371      0.711      -4.360       2.974

Notice the use of Treatment() to set the reference group. I've set it to Female in this case to match the R output, but with your dataset it might make more sense to use Male. Either way, its just an issue of being explicit about which group is used as reference.

glm in python vs R

Tags:

python

r

statistics

statsmodels

regression

user2528473

1 Answers

Simon

Recent Activity

Donate For Us

glm in python vs R

Tags:

python

r

statistics

statsmodels

regression

user2528473

1 Answers

Simon

Related questions

Recent Activity

Donate For Us