I have some dataset: titanic
Doing this in R
glm(Survived ~ Sex, titanic, family = "binomial")
I get
(Intercept) SexMale
1.124321 -2.477825
R takes survived as positive outcome.
But when I'm doing the same in Python
sm.formula.glm("Survived ~ Sex", family=sm.families.Binomial(), data=titanic).fit()
I get negative results: i.e. Python takes not survived as positive outcome.
How can I adjust Python's glm function behavior so it will return the same result as R does?
You just need to set your reference group to either male or female (depending on what you're interested in):
With a small test dataset in R, the code and model summary looks like this:
df <- data.frame(c(0,0,1,1,0), c("Male", "Female", "Female", "Male", "Male"))
colnames(df) <- c("Survived", "Sex")
model <- glm(Survived ~ Sex, data=df, family="binomial")
summary(model)
Output:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.084e-16 1.414e+00 0.000 1.000
SexMale -6.931e-01 1.871e+00 -0.371 0.711
To get something similar in Python/statsmodels:
import pandas as pd
import statsmodels.api as sm
df = pd.DataFrame({"Survived": [0,0,1,1,0],
"Sex": ["Male", "Female", "Female", "Male", "Male"]})
model = sm.formula.glm("Survived ~ C(Sex, Treatment(reference='Female'))",
family=sm.families.Binomial(), data=df).fit()
print(model.summary())
Which will give:
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------------------------
Intercept 5.551e-16 1.414 3.93e-16 1.000 -2.772 2.772
C(Sex, Treatment(reference='Female'))[T.Male] -0.6931 1.871 -0.371 0.711 -4.360 2.974
Notice the use of Treatment()
to set the reference group. I've set it to Female
in this case to match the R output, but with your dataset it might make more sense to use Male
. Either way, its just an issue of being explicit about which group is used as reference.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With