Two stage least square in R

Tags:

I want to run a two stage probit least square regression in R. Does anyone know how to do this? Is there any package out there? I know it's possible to do it using Stata, so I imagine it's possible to do it with R.

496

asked Oct 02 '10 00:10

Manoel Galdino

3 Answers

You might want to be more specific when you say 'two-stage-probit-least-squares'. Since you refer to a Stata program that implements this I am guessing you are talking about the CDSIMEQ package, which implements the Amemiya (1978) procedure for the Heckit model (a.k.a Generalized Tobit, a.k.a. Tobit type II model, etc.). As Grant said, systemfit will do a Tobit for you, but not with two equations. The MicEcon package did have a Heckit (but the package has split so many times I don't know where it is now).

If you want what the CDSIMEQ does, it can easily be implemented in R. I wrote a function that replicates CDSIMEQ:

tspls <- function(formula1, formula2, data) {
    # The Continous model
    mf1 <- model.frame(formula1, data)
    y1 <- model.response(mf1)
    x1 <- model.matrix(attr(mf1, "terms"), mf1)

    # The dicontionous model
    mf2 <- model.frame(formula2, data)
    y2 <- model.response(mf2)
    x2 <- model.matrix(attr(mf2, "terms"), mf2)

    # The matrix of all the exogenous variables
    X <- cbind(x1, x2)
    X <- X[, unique(colnames(X))]

    J1 <- matrix(0, nrow = ncol(X), ncol = ncol(x1))
    J2 <- matrix(0, nrow = ncol(X), ncol = ncol(x2))
    for (i in 1:ncol(x1)) J1[match(colnames(x1)[i], colnames(X)), i] <- 1
    for (i in 1:ncol(x2)) J2[match(colnames(x2)[i], colnames(X)), i] <- 1

    # Step 1:
    cat("\n\tNOW THE FIRST STAGE REGRESSION")
    m1 <- lm(y1 ~ X - 1)
    m2 <- glm(y2 ~ X - 1, family = binomial(link = "probit"))
    print(summary(m1))
    print(summary(m2))

    yhat1 <- m1$fitted.values
    yhat2 <- X %*% coef(m2)

    PI1 <- m1$coefficients
    PI2 <- m2$coefficients
    V0 <- vcov(m2)
    sigma1sq <- sum(m1$residuals ^ 2) / m1$df.residual
    sigma12 <- 1 / length(y2) * sum(y2 * m1$residuals / dnorm(yhat2))

    # Step 2:
    cat("\n\tNOW THE SECOND STAGE REGRESSION WITH INSTRUMENTS")

    m1 <- lm(y1 ~ yhat2 + x1 - 1)
    m2 <- glm(y2 ~ yhat1 + x2 - 1, family = binomial(link = "probit"))
    sm1 <- summary(m1)
    sm2 <- summary(m2)
    print(sm1)
    print(sm2)

    # Step  3:
    cat("\tNOW THE SECOND STAGE REGRESSION WITH CORRECTED STANDARD ERRORS\n\n")
    gamma1 <- m1$coefficients[1]
    gamma2 <- m2$coefficients[1]

    cc <- sigma1sq - 2 * gamma1 * sigma12
    dd <- gamma2 ^ 2 * sigma1sq - 2 * gamma2 * sigma12
    H <- cbind(PI2, J1)
    G <- cbind(PI1, J2)

    XX <- crossprod(X)                          # X'X
    HXXH <- solve(t(H) %*% XX %*% H)            # (H'X'XH)^(-1)
    HXXVXXH <- t(H) %*% XX %*% V0 %*% XX %*% H  # H'X'V0X'XH
    Valpha1 <- cc * HXXH + gamma1 ^ 2 * HXXH %*% HXXVXXH %*% HXXH

    GV <- t(G) %*% solve(V0)    # G'V0^(-1)
    GVG <- solve(GV %*% G)      # (G'V0^(-1)G)^(-1)
    Valpha2 <- GVG + dd * GVG %*% GV %*% solve(XX) %*% solve(V0) %*% G %*% GVG

    ans1 <- coef(sm1)
    ans2 <- coef(sm2)

    ans1[,2] <- sqrt(diag(Valpha1))
    ans2[,2] <- sqrt(diag(Valpha2))
    ans1[,3] <- ans1[,1] / ans1[,2]
    ans2[,3] <- ans2[,1] / ans2[,2]
    ans1[,4] <- 2 * pt(abs(ans1[,3]), m1$df.residual, lower.tail = FALSE)
    ans2[,4] <- 2 * pnorm(abs(ans2[,3]), lower.tail = FALSE)

    cat("Continuous:\n")
    print(ans1)
    cat("Dichotomous:\n")
    print(ans2)
}

For comparison, we can replicate the sample from the author of CDSIMEQ in their article about the package.

> library(foreign)
> cdsimeq <- read.dta("http://www.stata-journal.com/software/sj3-2/st0038/cdsimeq.dta")
> tspls(continuous ~ exog3 + exog2 + exog1 + exog4,
+     dichotomous ~ exog1 + exog2 + exog5 + exog6 + exog7,
+     data = cdsimeq)

        NOW THE FIRST STAGE REGRESSION
Call:
lm(formula = y1 ~ X - 1)

Residuals:
      Min        1Q    Median        3Q       Max 
-1.885921 -0.438579 -0.006262  0.432156  2.133738 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
X(Intercept)  0.010752   0.020620   0.521 0.602187    
Xexog3        0.158469   0.021862   7.249 8.46e-13 ***
Xexog2       -0.009669   0.021666  -0.446 0.655488    
Xexog1        0.159955   0.021260   7.524 1.19e-13 ***
Xexog4        0.316575   0.022456  14.097  < 2e-16 ***
Xexog5        0.497207   0.021356  23.282  < 2e-16 ***
Xexog6       -0.078017   0.021755  -3.586 0.000352 ***
Xexog7        0.161177   0.022103   7.292 6.23e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6488 on 992 degrees of freedom
Multiple R-squared: 0.5972,     Adjusted R-squared: 0.594 
F-statistic: 183.9 on 8 and 992 DF,  p-value: < 2.2e-16 


Call:
glm(formula = y2 ~ X - 1, family = binomial(link = "probit"))

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.49531  -0.59244   0.01983   0.59708   2.41810  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
X(Intercept)  0.08352    0.05280   1.582 0.113692    
Xexog3        0.21345    0.05678   3.759 0.000170 ***
Xexog2        0.21131    0.05471   3.862 0.000112 ***
Xexog1        0.45591    0.06023   7.570 3.75e-14 ***
Xexog4        0.39031    0.06173   6.322 2.57e-10 ***
Xexog5        0.75955    0.06427  11.818  < 2e-16 ***
Xexog6        0.85461    0.06831  12.510  < 2e-16 ***
Xexog7       -0.16691    0.05653  -2.953 0.003152 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1386.29  on 1000  degrees of freedom
Residual deviance:  754.14  on  992  degrees of freedom
AIC: 770.14

Number of Fisher Scoring iterations: 6


        NOW THE SECOND STAGE REGRESSION WITH INSTRUMENTS
Call:
lm(formula = y1 ~ yhat2 + x1 - 1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.32152 -0.53160  0.04886  0.53502  2.44818 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
yhat2         0.257592   0.021451  12.009   <2e-16 ***
x1(Intercept) 0.012185   0.024809   0.491    0.623    
x1exog3       0.042520   0.026735   1.590    0.112    
x1exog2       0.011854   0.026723   0.444    0.657    
x1exog1       0.007773   0.028217   0.275    0.783    
x1exog4       0.318636   0.028311  11.255   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.7803 on 994 degrees of freedom
Multiple R-squared: 0.4163,     Adjusted R-squared: 0.4128 
F-statistic: 118.2 on 6 and 994 DF,  p-value: < 2.2e-16 


Call:
glm(formula = y2 ~ yhat1 + x2 - 1, family = binomial(link = "probit"))

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.49610  -0.58595   0.01969   0.59857   2.41281  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
yhat1          1.26287    0.16061   7.863 3.75e-15 ***
x2(Intercept)  0.07080    0.05276   1.342 0.179654    
x2exog1        0.25093    0.06466   3.880 0.000104 ***
x2exog2        0.22604    0.05389   4.194 2.74e-05 ***
x2exog5        0.12912    0.09510   1.358 0.174544    
x2exog6        0.95609    0.07172  13.331  < 2e-16 ***
x2exog7       -0.37128    0.06759  -5.493 3.94e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1386.29  on 1000  degrees of freedom
Residual deviance:  754.21  on  993  degrees of freedom
AIC: 768.21

Number of Fisher Scoring iterations: 6

        NOW THE SECOND STAGE REGRESSION WITH CORRECTED STANDARD ERRORS

Continuous:
                Estimate Std. Error    t value   Pr(>|t|)
yhat2         0.25759209  0.1043073 2.46955009 0.01369540
x1(Intercept) 0.01218500  0.1198713 0.10165068 0.91905445
x1exog3       0.04252006  0.1291588 0.32920764 0.74206810
x1exog2       0.01185438  0.1290754 0.09184073 0.92684309
x1exog1       0.00777347  0.1363643 0.05700519 0.95455252
x1exog4       0.31863627  0.1367881 2.32941597 0.02003661
Dichotomous:
                 Estimate Std. Error    z value     Pr(>|z|)
yhat1          1.26286574  0.7395166  1.7076909 0.0876937093
x2(Intercept)  0.07079775  0.2666447  0.2655134 0.7906139867
x2exog1        0.25092561  0.3126763  0.8025092 0.4222584495
x2exog2        0.22603717  0.2739307  0.8251618 0.4092797527
x2exog5        0.12911922  0.4822986  0.2677163 0.7889176766
x2exog6        0.95609385  0.2823662  3.3860070 0.0007091758
x2exog7       -0.37128221  0.3265478 -1.1369920 0.2555416141

179

answered Oct 18 '22 01:10

eyjo

systemfit will also do the trick.

answered Oct 18 '22 03:10

Grant Cavanaugh

there are several packages available in R to do two state least squares. here are a few

sem: Two-Stage Least Squares
Zelig: Link removed, no longer functional (28.07.11)

let me know if these serve your purpose.

answered Oct 18 '22 02:10

Ramnath

Related questions
                            
                                Non-linear optimisation/programming with integer variables in R
                            
                                How to use submenu in rmarkdown navbar?
                            
                                R ggplot2 - legend at the bottom gets cut, how to find optimal number of columns for the legend on the fly?
                            
                                Different behavior of base R gsub and stringr::str_replace_all?
                            
                                Why does Rccp return a list-like output when I was expecting a dataframe output in R?
                            
                                R: Count frequency of values in nested list with sub-elements
                            
                                Troubleshooting 'Tool(s) not installed or not in PATH: ghostcript' warning in RStudio
                            
                                changing column names of a data frame by changing values - R
                            
                                function to track the changes in a field
                            
                                How to change NA into 0 based on other variable / how many times it was recorded
                            
                                Replace NAs with missing values in sequence (R)
                            
                                Is it possible to have `dput` return source code that would run outside of the enclosing environment?
                            
                                Should "while loops" be preferred to "for loops" for large, necessary loops in R?
                            
                                Draw hyperplane in R?
                            
                                understanding dates/times (POSIXc and POSIXct) in R
                            
                                How to merge two data.frames together in R, referencing a lookup table
                            
                                How to draw a chart with sorted horizontal error bars (sorted barcharts with error marks)?
                            
                                How would you program Pascal's triangle in R?
                            
                                How do I calculate the length of consecutive runs of events, e.g. wins, visits, in R
                            
                                Improvements to the base R graphics

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Two stage least square in R

Tags:

r

regression

stata

least-squares