Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replicating Stata Probit with robust errors in R

I am trying to replicate a Stata Output in R. I am using the dataset affairs. I am having trouble replicating the probit function with robust standard errors.

The Stata code looks like that:

probit affair male age yrsmarr kids relig educ ratemarr, r

I've started with:

 probit1 <- glm(affair ~ male + age + yrsmarr + kids + relig + educ + ratemarr, 
           family = binomial (link = "probit"), data = mydata)

Then I've tried various adjustments with the sandwich package, such as:

myProbit <- function(probit1, vcov = sandwich(..., adjust = TRUE)) {
            print(coeftest(probit1, vcov = sandwich(probit1, adjust = TRUE)))
}

Or (with all types HC0 to HC5):

myProbit <- function(probit1, vcov = sandwich) {
            print(coeftest(probit1, vcovHC(probit1, type = "HC0"))  
}

Or this, as was suggested here (do I have to enter something different for object?):

sandwich1 <- function(object, ...) sandwich(object) * nobs(object) / (nobs(object) - 1)
coeftest(probit1, vcov = sandwich1)

None of these attempts led to the same standard errors or z-values from the stata output.

Hoping for some constructive ideas!

Thanks in advance!

like image 992
Semprini Avatar asked May 14 '15 11:05

Semprini


2 Answers

For folks who are considering jumping on this wagon, here is some code demonstrating the problem (data here):

clear
set more off
capture ssc install bcuse
capture ssc install rsource
bcuse affairs

saveold affairs, version(12) replace

rsource, terminator(XXX)
  library("foreign")
  library("lmtest")
  library("sandwich")
  mydata<-read.dta("affairs.dta")
  probit1<-glm(affair ~ male + age + yrsmarr + kids + relig + educ + ratemarr, family = binomial (link = "probit"), data = mydata)
  sandwich1 <- function(object,...) sandwich(object) * nobs(object)/(nobs(object) - 1)
  coeftest(probit1,vcov = sandwich1)
XXX 

probit affair male age yrsmarr kids relig educ ratemarr, robust cformat(%9.6f) nolog

R gives:

z test of coefficients:

             Estimate Std. Error z value  Pr(>|z|)    
(Intercept)  0.764157   0.546692  1.3978 0.1621780    
male         0.188816   0.133260  1.4169 0.1565119    
age         -0.024400   0.011423 -2.1361 0.0326725 *  
yrsmarr      0.054608   0.019025  2.8703 0.0041014 ** 
kids         0.208072   0.168222  1.2369 0.2161261    
relig       -0.186085   0.053968 -3.4480 0.0005647 ***
educ         0.015506   0.026389  0.5876 0.5568012    
ratemarr    -0.272711   0.053668 -5.0814 3.746e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Stata yields:

Probit regression                               Number of obs     =        601
                                                Wald chi2(7)      =      54.93
                                                Prob > chi2       =     0.0000
Log pseudolikelihood =  -305.2525               Pseudo R2         =     0.0961

------------------------------------------------------------------------------
             |               Robust
      affair |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        male |   0.188817   0.131927     1.43   0.152    -0.069755    0.447390
         age |  -0.024400   0.011124    -2.19   0.028    -0.046202   -0.002597
     yrsmarr |   0.054608   0.018963     2.88   0.004     0.017441    0.091775
        kids |   0.208075   0.166243     1.25   0.211    -0.117754    0.533905
       relig |  -0.186085   0.053240    -3.50   0.000    -0.290435   -0.081736
        educ |   0.015505   0.026355     0.59   0.556    -0.036150    0.067161
    ratemarr |  -0.272710   0.053392    -5.11   0.000    -0.377356   -0.168064
       _cons |   0.764160   0.534335     1.43   0.153    -0.283117    1.811437
------------------------------------------------------------------------------

Addendum:

The difference in covariance estimation of coefficients is due to the different fitting algorithms. In R, the glm command uses iterative least-square method, while Stata's probit uses an ML method based on Newton-Raphson algorithm. You can match what R is doing with glm in Stata with irls option:

glm affair male age yrsmarr kids relig educ ratemarr, irls family(binomial) link(probit) robust

This yields:

Generalized linear models                         No. of obs      =        601
Optimization     : MQL Fisher scoring             Residual df     =        593
                   (IRLS EIM)                     Scale parameter =          1
Deviance         =  610.5049916                   (1/df) Deviance =   1.029519
Pearson          =  619.0405832                   (1/df) Pearson  =   1.043913

Variance function: V(u) = u*(1-u)                 [Bernoulli]
Link function    : g(u) = invnorm(u)              [Probit]

                                                  BIC             =  -3183.862

------------------------------------------------------------------------------
             |             Semirobust
      affair |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        male |   0.188817   0.133260     1.42   0.157    -0.072367    0.450002
         age |  -0.024400   0.011422    -2.14   0.033    -0.046787   -0.002012
     yrsmarr |   0.054608   0.019025     2.87   0.004     0.017319    0.091897
        kids |   0.208075   0.168222     1.24   0.216    -0.121634    0.537785
       relig |  -0.186085   0.053968    -3.45   0.001    -0.291862   -0.080309
        educ |   0.015505   0.026389     0.59   0.557    -0.036216    0.067226
    ratemarr |  -0.272710   0.053668    -5.08   0.000    -0.377898   -0.167522
       _cons |   0.764160   0.546693     1.40   0.162    -0.307338    1.835657
------------------------------------------------------------------------------

These will be close, though not identical. I am not sure how to get R to use something like NR without a whole lot of work.

like image 136
dimitriy Avatar answered Sep 20 '22 06:09

dimitriy


I am using matrix approach as described in detail here (p.57) to match the R results with Stata. However, I couldn't match the result exactly yet. I think the small difference might be due to difference in scores. Scores in R match with Stata only up to 4 decimal places .

Stata

clear all
bcuse affairs

probit affair male age yrsmarr kids relig educ ratemarr
mat var_nr=e(V)
predict double u, score
matrix accum s = male age yrsmarr kids relig educ ratemarr [iweight=u^2*601/600] //n=601,n-1=600
matrix rv = var_nr*s*var_nr
mat diagrv=vecdiag(rv)
matmap diagrv rse,m(sqrt(@)) //install matmap 
mat list rse //standard errors

This gives you the same standard errors as:

qui probit affair male age yrsmarr kids relig educ ratemarr,r



rse[1,8]
       affair:    affair:    affair:    affair:    affair:    affair:    affair:    affair:
         male        age    yrsmarr       kids      relig       educ   ratemarr      _cons
r1  .13192707  .01112372  .01896336  .16624258  .05324046  .02635524  .05339163  .53433495

R:

library(AER) # Affairs data
data(Affairs)
mydata<-Affairs
mydata$affairs<-with(mydata,ifelse(affairs>0,1,affairs)) # convert to 1 and 0 
probit1<-glm(affairs ~ gender+ age + yearsmarried + children + religiousness+education + rating,family = binomial(link = "probit"),data = mydata)
u<-subset(estfun(probit1),select="(Intercept)") #scores: perfectly matches to 4 decimals with Stata: difference may be due to this step
w0<-u%*%t(u)*(601/600) #(n/n-1)
iweight<-matrix(0,nrow=601,ncol=601) #perfectly matches to 4 decimals with Stata 
diag(iweight)<-diag(w0) 
x<-model.matrix(probit1)  
s<-t(x)%*%iweight%*%x #doesn't match with Stata : 
rv<-vcov(probit1)%*%s%*%vcov(probit1)
rse<-sqrt(diag(rv)) # standard  errors
   rse
  (Intercept)    gendermale           age  yearsmarried   childrenyes religiousness     education        rating 
   0.54669177    0.13325951    0.01142258    0.01902537    0.16822161    0.05396841    0.02638902    0.05366828 

This matches with:

 sandwich1 <- function(object, ...) sandwich(object) * nobs(object) / (nobs(object) - 1)
coeftest(probit1, vcov = sandwich1) 

Conclusion: The difference in results between R and Stata is due to difference in scores (matches only up to 4 decimal places).

like image 22
Metrics Avatar answered Sep 21 '22 06:09

Metrics