Cross posted on CrossValidated.
A while ago, I asked this question, which was about correcting the standard errors when using IV/2SLS and the first stage has a tobit distribution, on which I got an amazing answer from jay.sf (example data at the bottom). He provided me with the following function:
vcov2sls <- function(s1, s2, data, type=2) {
## get y names
y1.nm <- gsub(".*=\\s(.*)(?=\\s~).*", "\\1", deparse(s1$call)[1], perl=TRUE)
y2.nm <- as.character(s2$terms)[2]
## auxilliary model matrix
X <- cbind(`(Intercept)`=1, data[, y1.nm, F], model.matrix(s2)[,-(1:2)])
## get y
y <- data[, y2.nm]
## betas second stage
b <- s2$coefficients
## calculate corrected sums of squares
sse <- sum((y - b %*% t(X))^2)
rmse <- sqrt(mean(s2$residuals^2)) ## RMSE 2nd stage
V0 <- vcov(s2) ## biased vcov 2nd stage
dof <- s2$df.residual ## degrees of freedom 2nd stage
## calculate corrected RMSE
rmse.c <- sqrt(sse/dof)
## calculate corrected vcov
V <- (rmse.c/rmse)^2 * V0
return(V)
}
So what I am looking for, is a function in which I can provide both the vcov matrix (the vcov2sls
), and have robust and clustered standard errors. However it seems that they both pertain to the same vcov matrix option. So if I supply one, I already have to make sure the se's are clustered and robust.. So I guess I am essentially asking how I can make the vcov2sls
function have robust and clustered errors. Obviously any other solution leading to the same practical outcome would be great as well.
However I want to use jay.sf's function, when the first stage is an lm
, the clustering takes part in the summary (source), for example:
first_stage_ols <- lm(y~x, data=DT)
summary(first_stage_ols, robust=T)
Is there either, a way to correct the standard errors from within the lm function , or (replaced them in the result), or adapt the vcov2sls
function to also account for robust/clustered standard errors?
EDIT: I know that also lmtest:coeftest
exists, but I want to able to use weights
. See this link. I am having trouble figuring out if this is possible in lmtest:coeftest
.
So I tried testers answer in two situations. Firstly where I move from a tobit to a lm, and the other vice versa.
# Tobit -> LM
library(lmtest)
library(sandwich)
## run with lm ##
s1.tobit <- AER::tobit(taxrate ~ votewon + industry + size + urbanisation + vote, data=DF)
# cluster and adjust ses
s1.robust <- vcovCL(s1.tobit, cluster = ~ industry)
s1.robust.se <- sqrt(diag(s1.robust))
s1.summary <- summary(s1.tobit)
s1.summary$coefficients[, 2] <- s1.robust.se
yhat <- fitted(s1.tobit)
s2.lm <- lm(sales ~ yhat + industry + size + urbanisation + vote, data=DF)
lmtest::coeftest(s2.lm, vcov.=vcov2sls(s1.summary, s2.lm, DF))
# WORKS!
Vice versa:
# LM -> tobit
library(lmtest)
library(sandwich)
## run with lm ##
s1.lm <- lm(taxrate ~ votewon + industry + size + urbanisation + vote, data=DF)
# cluster and adjust ses
s1.robust <- vcovCL(s1.lm, cluster = ~ industry)
s1.robust.se <- sqrt(diag(s1.robust))
s1.summary <- summary(s1.lm)
s1.summary$coefficients[, 2] <- s1.robust.se
yhat <- fitted(s1.lm)
s2.tobit <- AER::tobit(sales ~ yhat + industry + size + urbanisation + vote, data=DF)
and then ????
# DOES NOT WORK, NO WAY TO ADD THE VCOV TO TOBIT
END OF EDIT
When using lm_robust
the result of the first stage is as follows:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 25.3890287 2.1726518 11.6857327 0.009184928 15.393996 35.3840612 1.870781
votewon -0.9900966 2.1099738 -0.4692459 0.687605609 -10.636404 8.6562105 1.882014
industryE -0.7564888 0.3710393 -2.0388372 0.184868777 -2.434709 0.9217314 1.901678
industryF -2.6639323 0.3058024 -8.7112866 0.013649538 -4.002800 -1.3250647 1.964416
size -0.5291956 0.5523497 -0.9580807 0.443894805 -3.036862 1.9784705 1.894753
urbanisationB -1.5851495 2.2454251 -0.7059463 0.554845739 -11.464414 8.2941148 1.954657
urbanisationC -2.7234541 0.3573827 -7.6205532 0.020124544 -4.365749 -1.0811587 1.872744
vote 3.1749142 2.4600297 1.2906000 0.341874112 -9.076839 15.4266675 1.740353
However when doing the manual calculations the p-values are very different:
s1.summary
Call:
lm(formula = taxrate ~ votewon + industry + size + urbanisation +
vote, data = DF)
Residuals:
Min 1Q Median 3Q Max
-11.2747 -4.3212 -0.6788 4.3677 10.7369
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.3890 2.1506 13.845 <2e-16 ***
votewon -0.9901 2.1742 -0.676 0.5007
industryE -0.7565 0.3492 -0.557 0.5792
industryF -2.6639 0.2877 -1.855 0.0668 .
size -0.5292 0.5109 -1.250 0.2145
urbanisationB -1.5851 2.2311 -1.098 0.2753
urbanisationC -2.7235 0.3474 -1.704 0.0918 .
vote 3.1749 2.4840 2.105 0.0380 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.623 on 92 degrees of freedom
Multiple R-squared: 0.1054, Adjusted R-squared: 0.03734
F-statistic: 1.549 on 7 and 92 DF, p-value: 0.1609
And this is only for the first stage.
DF <- structure(list(country = c("C", "C", "C", "C", "J", "J", "B",
"B", "F", "F", "E", "E", "D", "D", "F", "F", "I", "I", "J", "J",
"E", "E", "C", "C", "I", "I", "I", "I", "I", "I", "C", "C", "H",
"H", "J", "J", "G", "G", "J", "J", "I", "I", "C", "C", "D", "D",
"A", "A", "G", "G", "E", "E", "J", "J", "G", "G", "I", "I", "I",
"I", "J", "J", "G", "G", "E", "E", "G", "G", "E", "E", "F", "F",
"I", "I", "B", "B", "E", "E", "H", "H", "B", "B", "A", "A", "I",
"I", "I", "I", "F", "F", "E", "E", "I", "I", "J", "J", "D", "D",
"F", "F"), year = c(2005, 2010, 2010, 2005, 2005, 2010, 2010,
2005, 2010, 2005, 2005, 2010, 2010, 2005, 2005, 2010, 2005, 2010,
2005, 2010, 2010, 2005, 2010, 2005, 2005, 2010, 2005, 2010, 2010,
2005, 2010, 2005, 2005, 2010, 2010, 2005, 2005, 2010, 2005, 2010,
2005, 2010, 2005, 2010, 2010, 2005, 2005, 2010, 2010, 2005, 2010,
2005, 2010, 2005, 2010, 2005, 2010, 2005, 2010, 2005, 2010, 2005,
2010, 2005, 2010, 2005, 2010, 2005, 2005, 2010, 2005, 2010, 2005,
2010, 2005, 2010, 2005, 2010, 2005, 2010, 2010, 2005, 2005, 2010,
2005, 2010, 2010, 2005, 2010, 2005, 2010, 2005, 2005, 2010, 2005,
2010, 2010, 2005, 2010, 2005), sales = c(15.48, 12.39, 3.72,
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72,
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72,
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72,
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72,
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72,
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72,
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72,
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72,
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72,
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9), industry = c("D",
"D", "E", "E", "F", "F", "F", "F", "D", "D", "E", "E", "D", "D",
"E", "E", "F", "F", "F", "F", "D", "D", "F", "F", "E", "E", "D",
"D", "D", "D", "E", "E", "F", "F", "D", "D", "E", "E", "E", "E",
"D", "D", "E", "E", "D", "D", "D", "D", "E", "E", "D", "D", "F",
"F", "D", "D", "D", "D", "E", "E", "D", "D", "E", "E", "D", "D",
"D", "D", "D", "D", "F", "F", "F", "F", "E", "E", "D", "D", "E",
"E", "F", "F", "E", "E", "F", "F", "E", "E", "F", "F", "D", "D",
"D", "D", "D", "D", "D", "D", "F", "F"), urbanisation = c("B",
"B", "A", "A", "B", "B", "A", "A", "C", "C", "C", "C", "A", "A",
"B", "B", "C", "C", "A", "A", "C", "C", "B", "B", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "C", "C", "B", "B", "B", "B",
"B", "B", "C", "C", "A", "A", "B", "B", "B", "B", "A", "A", "B",
"B", "A", "A", "A", "A", "B", "B", "C", "C", "A", "A", "C", "C",
"A", "A", "B", "B", "A", "A", "B", "B", "B", "B", "B", "B", "C",
"C", "A", "A", "A", "A", "A", "A", "A", "A", "C", "C", "A", "A",
"B", "B", "A", "A", "B", "B", "B", "B"), size = c(1, 1, 5, 5,
5, 5, 1, 1, 1, 1, 5, 5, 5, 5, 2, 2, 2, 2, 5, 5, 1, 1, 1, 1, 5,
5, 5, 5, 4, 4, 5, 5, 5, 5, 4, 4, 2, 2, 5, 5, 1, 1, 1, 1, 2, 2,
1, 1, 2, 2, 5, 5, 1, 1, 3, 3, 2, 2, 2, 2, 5, 5, 4, 4, 1, 1, 5,
5, 2, 2, 5, 5, 2, 2, 2, 2, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3,
5, 5, 3, 3, 2, 2, 3, 3, 1, 1, 5, 5), base_rate = c(14L, 14L,
14L, 14L, 19L, 19L, 30L, 30L, 20L, 20L, 29L, 29L, 20L, 20L, 20L,
20L, 24L, 24L, 19L, 19L, 29L, 29L, 14L, 14L, 24L, 24L, 24L, 24L,
24L, 24L, 14L, 14L, 17L, 17L, 19L, 19L, 33L, 33L, 19L, 19L, 24L,
24L, 14L, 14L, 20L, 20L, 23L, 23L, 33L, 33L, 29L, 29L, 19L, 19L,
33L, 33L, 24L, 24L, 24L, 24L, 19L, 19L, 33L, 33L, 29L, 29L, 33L,
33L, 29L, 29L, 20L, 20L, 24L, 24L, 30L, 30L, 29L, 29L, 17L, 17L,
30L, 30L, 23L, 23L, 24L, 24L, 24L, 24L, 20L, 20L, 29L, 29L, 24L,
24L, 19L, 19L, 20L, 20L, 20L, 20L), taxrate = c(12L, 14L, 14L,
12L, 21L, 18L, 30L, 30L, 20L, 20L, 29L, 30L, 20L, 20L, 20L, 20L,
24L, 24L, 21L, 18L, 30L, 29L, 14L, 12L, 24L, 24L, 24L, 24L, 24L,
24L, 14L, 12L, 18L, 19L, 18L, 21L, 33L, 32L, 21L, 18L, 24L, 24L,
12L, 14L, 20L, 20L, 22L, 25L, 32L, 33L, 30L, 29L, 18L, 21L, 32L,
33L, 24L, 24L, 24L, 24L, 18L, 21L, 32L, 33L, 30L, 29L, 32L, 33L,
29L, 30L, 20L, 20L, 24L, 24L, 30L, 30L, 29L, 30L, 18L, 19L, 30L,
30L, 22L, 25L, 24L, 24L, 24L, 24L, 20L, 20L, 30L, 29L, 24L, 24L,
21L, 18L, 20L, 20L, 20L, 20L), vote = c(0, 0, 0, 0, 1, 1, 1,
0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1,
1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1,
1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,
1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0,
1, 0, 1, 1, 1, 1, 0, 1, 1), votewon = c(0, 0, 0, 0, 1, 0, 1,
0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1,
1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,
1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0,
0, 0, 1, 1, 0, 1, 0, 1, 1)), class = "data.frame", row.names = c(NA,
-100L))
## convert variables to factors beforehand
DF[c(1, 2, 4, 5, 6, 9, 10)] <- lapply(DF[c(1, 2, 4, 5, 6, 9, 10)], factor)
s1.tobit <- AER::tobit(taxrate ~ votewon + industry + size + urbanisation + vote,
left=12, right=33, data=DF)
yhat <- fitted(s1.tobit)
s2.tobit <- lm(sales ~ yhat + industry + size + urbanisation + vote, data=DF)
lmtest::coeftest(s2.tobit, vcov.=vcov2sls(s1.tobit, s2.tobit, DF))
EDIT:
This way you can run a lm in the first stage, adjust the model's SEs and use it to overwrite the SEs produced by summary(lm)
. Then, you can estimate the second stage and use your custom function with coeftest()
. Not sure about the clustering but this should roughly work, no?
library(lmtest)
library(sandwich)
## run with lm ##
s1.lm <- lm(taxrate ~ votewon + industry + size + urbanisation + vote,
data=DF)
# cluster and adjust ses
s1.robust <- vcovCL(s1.lm, cluster = ~ industry)
s1.robust.se <- sqrt(diag(s1.robust))
s1.summary <- summary(s1.lm)
s1.summary$coefficients[, 2] <- s1.robust.se
yhat <- fitted(s1.lm)
s2.lm <- lm(sales ~ yhat + industry + size + urbanisation + vote, data=DF)
lmtest::coeftest(s2.lm, vcov.=vcov2sls(s1.summary, s2.lm, DF))
Take a look at the estimatr
package, especially lm_robust
. I'm not 100% sure what you intend to do, but you can get robust standard errors by running this:
library(estimatr)
lm1 <-
lm_robust(
mpg ~ hp,
data = mtcars,
clusters = cyl,
weights = wt,
se_type = "stata" # alternatives: CR0, CR1
)
summary(lm1)
Using your example from above, is this roughly what you are looking for? Please note, that lm_robust already adjusts the SEs.
s1.lm <- lm_robust(taxrate ~ votewon + industry + size + urbanisation + vote,
data = DF)
yhat <- fitted(s1.lm)
s2.lm <- lm(sales ~ yhat + industry + size + urbanisation + vote, data = DF)
> lmtest::coeftest(s2.lm, vcov. = vcov2sls(s1.lm, s2.lm, DF))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -18.45116 62.14257 -0.2969 0.7672
yhat 1.57784 2.72176 0.5797 0.5636
industryE 0.98174 5.10677 0.1922 0.8480
industryF 2.09036 7.25181 0.2883 0.7738
size2 -8.85327 12.43454 -0.7120 0.4783
size3 -5.74011 7.14973 -0.8028 0.4242
size4 -10.79326 13.14534 -0.8211 0.4138
size5 -3.38280 5.45691 -0.6199 0.5369
urbanisationB -1.74588 6.34107 -0.2753 0.7837
urbanisationC -2.00370 6.48533 -0.3090 0.7581
vote1 -1.01661 6.49424 -0.1565 0.8760
> lmtest::coeftest(s2.lm)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -18.45116 46.39576 -0.3977 0.6918
yhat 1.57784 2.03207 0.7765 0.4395
industryE 0.98174 3.81273 0.2575 0.7974
industryF 2.09036 5.41421 0.3861 0.7004
size2 -8.85327 9.28365 -0.9536 0.3428
size3 -5.74011 5.33801 -1.0753 0.2851
size4 -10.79326 9.81434 -1.0997 0.2744
size5 -3.38280 4.07414 -0.8303 0.4086
urbanisationB -1.74588 4.73425 -0.3688 0.7132
urbanisationC -2.00370 4.84196 -0.4138 0.6800
vote1 -1.01661 4.84861 -0.2097 0.8344
I'll update my answer depending on your comments.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With