I am trying to replicate a Stata Output in R. I am using the dataset affairs. I am having trouble replicating the probit function with robust standard errors.
The Stata code looks like that:
probit affair male age yrsmarr kids relig educ ratemarr, r
I've started with:
probit1 <- glm(affair ~ male + age + yrsmarr + kids + relig + educ + ratemarr,
family = binomial (link = "probit"), data = mydata)
Then I've tried various adjustments with the sandwich
package, such as:
myProbit <- function(probit1, vcov = sandwich(..., adjust = TRUE)) {
print(coeftest(probit1, vcov = sandwich(probit1, adjust = TRUE)))
}
Or (with all types HC0
to HC5
):
myProbit <- function(probit1, vcov = sandwich) {
print(coeftest(probit1, vcovHC(probit1, type = "HC0"))
}
Or this, as was suggested here (do I have to enter something different for object
?):
sandwich1 <- function(object, ...) sandwich(object) * nobs(object) / (nobs(object) - 1)
coeftest(probit1, vcov = sandwich1)
None of these attempts led to the same standard errors or z-values from the stata output.
Hoping for some constructive ideas!
Thanks in advance!
For folks who are considering jumping on this wagon, here is some code demonstrating the problem (data here):
clear
set more off
capture ssc install bcuse
capture ssc install rsource
bcuse affairs
saveold affairs, version(12) replace
rsource, terminator(XXX)
library("foreign")
library("lmtest")
library("sandwich")
mydata<-read.dta("affairs.dta")
probit1<-glm(affair ~ male + age + yrsmarr + kids + relig + educ + ratemarr, family = binomial (link = "probit"), data = mydata)
sandwich1 <- function(object,...) sandwich(object) * nobs(object)/(nobs(object) - 1)
coeftest(probit1,vcov = sandwich1)
XXX
probit affair male age yrsmarr kids relig educ ratemarr, robust cformat(%9.6f) nolog
R gives:
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.764157 0.546692 1.3978 0.1621780
male 0.188816 0.133260 1.4169 0.1565119
age -0.024400 0.011423 -2.1361 0.0326725 *
yrsmarr 0.054608 0.019025 2.8703 0.0041014 **
kids 0.208072 0.168222 1.2369 0.2161261
relig -0.186085 0.053968 -3.4480 0.0005647 ***
educ 0.015506 0.026389 0.5876 0.5568012
ratemarr -0.272711 0.053668 -5.0814 3.746e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Stata yields:
Probit regression Number of obs = 601
Wald chi2(7) = 54.93
Prob > chi2 = 0.0000
Log pseudolikelihood = -305.2525 Pseudo R2 = 0.0961
------------------------------------------------------------------------------
| Robust
affair | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
male | 0.188817 0.131927 1.43 0.152 -0.069755 0.447390
age | -0.024400 0.011124 -2.19 0.028 -0.046202 -0.002597
yrsmarr | 0.054608 0.018963 2.88 0.004 0.017441 0.091775
kids | 0.208075 0.166243 1.25 0.211 -0.117754 0.533905
relig | -0.186085 0.053240 -3.50 0.000 -0.290435 -0.081736
educ | 0.015505 0.026355 0.59 0.556 -0.036150 0.067161
ratemarr | -0.272710 0.053392 -5.11 0.000 -0.377356 -0.168064
_cons | 0.764160 0.534335 1.43 0.153 -0.283117 1.811437
------------------------------------------------------------------------------
Addendum:
The difference in covariance estimation of coefficients is due to the
different fitting algorithms. In R, the glm
command uses iterative least-square method, while Stata's probit
uses an ML method based on Newton-Raphson algorithm. You can match what R is doing with glm
in Stata with irls
option:
glm affair male age yrsmarr kids relig educ ratemarr, irls family(binomial) link(probit) robust
This yields:
Generalized linear models No. of obs = 601
Optimization : MQL Fisher scoring Residual df = 593
(IRLS EIM) Scale parameter = 1
Deviance = 610.5049916 (1/df) Deviance = 1.029519
Pearson = 619.0405832 (1/df) Pearson = 1.043913
Variance function: V(u) = u*(1-u) [Bernoulli]
Link function : g(u) = invnorm(u) [Probit]
BIC = -3183.862
------------------------------------------------------------------------------
| Semirobust
affair | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
male | 0.188817 0.133260 1.42 0.157 -0.072367 0.450002
age | -0.024400 0.011422 -2.14 0.033 -0.046787 -0.002012
yrsmarr | 0.054608 0.019025 2.87 0.004 0.017319 0.091897
kids | 0.208075 0.168222 1.24 0.216 -0.121634 0.537785
relig | -0.186085 0.053968 -3.45 0.001 -0.291862 -0.080309
educ | 0.015505 0.026389 0.59 0.557 -0.036216 0.067226
ratemarr | -0.272710 0.053668 -5.08 0.000 -0.377898 -0.167522
_cons | 0.764160 0.546693 1.40 0.162 -0.307338 1.835657
------------------------------------------------------------------------------
These will be close, though not identical. I am not sure how to get R to use something like NR without a whole lot of work.
I am using matrix approach as described in detail here (p.57) to match the R results with Stata. However, I couldn't match the result exactly yet. I think the small difference might be due to difference in scores. Scores in R
match with Stata
only up to 4 decimal places .
Stata
clear all
bcuse affairs
probit affair male age yrsmarr kids relig educ ratemarr
mat var_nr=e(V)
predict double u, score
matrix accum s = male age yrsmarr kids relig educ ratemarr [iweight=u^2*601/600] //n=601,n-1=600
matrix rv = var_nr*s*var_nr
mat diagrv=vecdiag(rv)
matmap diagrv rse,m(sqrt(@)) //install matmap
mat list rse //standard errors
This gives you the same standard errors as:
qui probit affair male age yrsmarr kids relig educ ratemarr,r
rse[1,8]
affair: affair: affair: affair: affair: affair: affair: affair:
male age yrsmarr kids relig educ ratemarr _cons
r1 .13192707 .01112372 .01896336 .16624258 .05324046 .02635524 .05339163 .53433495
R:
library(AER) # Affairs data
data(Affairs)
mydata<-Affairs
mydata$affairs<-with(mydata,ifelse(affairs>0,1,affairs)) # convert to 1 and 0
probit1<-glm(affairs ~ gender+ age + yearsmarried + children + religiousness+education + rating,family = binomial(link = "probit"),data = mydata)
u<-subset(estfun(probit1),select="(Intercept)") #scores: perfectly matches to 4 decimals with Stata: difference may be due to this step
w0<-u%*%t(u)*(601/600) #(n/n-1)
iweight<-matrix(0,nrow=601,ncol=601) #perfectly matches to 4 decimals with Stata
diag(iweight)<-diag(w0)
x<-model.matrix(probit1)
s<-t(x)%*%iweight%*%x #doesn't match with Stata :
rv<-vcov(probit1)%*%s%*%vcov(probit1)
rse<-sqrt(diag(rv)) # standard errors
rse
(Intercept) gendermale age yearsmarried childrenyes religiousness education rating
0.54669177 0.13325951 0.01142258 0.01902537 0.16822161 0.05396841 0.02638902 0.05366828
This matches with:
sandwich1 <- function(object, ...) sandwich(object) * nobs(object) / (nobs(object) - 1)
coeftest(probit1, vcov = sandwich1)
Conclusion: The difference in results between R and Stata is due to difference in scores (matches only up to 4 decimal places).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With