I'm experimenting with negative binomial regression using Python. I found this example using R, along with a data set:
http://www.karlin.mff.cuni.cz/~pesta/NMFM404/NB.html
I tried to replicate the results on the web page using this code:
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
df = pd.read_stata("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/nb_data.dta")
model = smf.glm(formula = "daysabs ~ math + prog", data=df, family=sm.families.NegativeBinomial()).fit()
model.summary()
Unfortunately this didn't give the same coefficients. It gave the following:
coef std err z P>|z| [95.0% Conf. Int.]
Intercept 3.4875 0.236 14.808 0.000 3.026 3.949
math -0.0067 0.003 -2.600 0.009 -0.012 -0.002
prog -0.6781 0.101 -6.683 0.000 -0.877 -0.479
These aren't even close to those on the website. Assuming the R code is correct, what am I doing wrong?
The reason for the discrepancy is that when you are reading in the dataset with Pandas, the prog
variable is treated as type float
by default:
df.prog.head()
0 2.0
1 2.0
2 2.0
3 2.0
4 2.0
Name: prog, dtype: float32
In the R example, on the other hand, the prog
variable was explicitly cast to a factor (categorical) variable:
dat <- within(dat, {
prog <- factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
id <- factor(id)
})
As a result, when you look at the fit summary in R, you can see that the prog
variable has been split into n-1 binary-encoded terms:
> summary(m1 <- glm.nb(daysabs ~ math + prog, data = dat))
Call:
glm.nb(formula = daysabs ~ math + prog, data = dat, init.theta = 1.032713156,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1547 -1.0192 -0.3694 0.2285 2.5273
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.615265 0.197460 13.245 < 2e-16 ***
math -0.005993 0.002505 -2.392 0.0167 *
progAcademic -0.440760 0.182610 -2.414 0.0158 *
progVocational -1.278651 0.200720 -6.370 1.89e-10 ***
Compare this to how the prog
variable appears in the Python fit summary you posted.
To fix the issue, you can use the C()
function to cast the variable to categorical in statsmodels. This way you will arrive at the same result:
model = smf.glm(formula = "daysabs ~ math + C(prog)", data=df, family=sm.families.NegativeBinomial()).fit()
model.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: daysabs No. Observations: 314
Model: GLM Df Residuals: 310
Model Family: NegativeBinomial Df Model: 3
Link Function: log Scale: 1.06830885199
Method: IRLS Log-Likelihood: -865.68
Date: Thu, 16 Feb 2017 Deviance: 350.98
Time: 10:34:16 Pearson chi2: 331.
No. Iterations: 6
==================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 2.6150 0.207 12.630 0.000 2.209 3.021
C(prog)[T.2.0] -0.4408 0.192 -2.302 0.021 -0.816 -0.065
C(prog)[T.3.0] -1.2786 0.210 -6.079 0.000 -1.691 -0.866
math -0.0060 0.003 -2.281 0.023 -0.011 -0.001
==================================================================================
"""
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With