Clustered Standard Errors with data containing NAs

Question

I'm unable to cluster standard errors using R and guidance based on this post. The cl function returns the error:

Error in tapply(x, cluster1, sum) : arguments must have same length

After reading up on tapply I'm still not sure why my cluster argument is the wrong length, and what is causing this error.

Here is a link to the data set that I'm using.

https://www.dropbox.com/s/y2od7um9pp4vn0s/Ec%201820%20-%20DD%20Data%20with%20Controls.csv

Here is the R code:

# read in data
charter<-read.csv(file.choose())
View(charter)
colnames(charter)

# standardize NAEP scores
charter$naep.standardized <- (charter$naep - mean(charter$naep, na.rm=T))/sd(charter$naep, na.rm=T)

# change NAs in year.passed column to 2014
charter$year.passed[is.na(charter$year.passed)]<-2014

# Add column with indicator for in treatment (passed legislation)
charter$treatment<-ifelse(charter$year.passed<=charter$year,1,0)

# fit model
charter.model<-lm(naep ~ factor(year) + factor(state) + treatment, data = charter)
summary(charter.model)
# account for clustered standard errors by state
cl(dat=charter, fm=charter.model, cluster=charter$state)

# accounting for controls
charter.model.controls<-lm(naep~factor)

# clustered standard errors
# ---------

# function that calculates clustered standard errors
# source: http://thetarzan.wordpress.com/2011/06/11/clustered-standard-errors-in-r/
cl   <- function(dat, fm, cluster){
  require(sandwich, quietly = TRUE)
  require(lmtest, quietly = TRUE)
  M <- length(unique(cluster))
  N <- length(cluster)
  K <- fm$rank
  dfc <- (M/(M-1))*((N-1)/(N-K))
  print(K)
  uj  <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum));
  vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
  coeftest(fm, vcovCL) 
}

# calculate clustered standard errors 
cl(charter, charter.model, charter$state)

The inner workings of the function are a little over my head.

landroni · Accepted Answer

When you execute your code, notice that there are missing observations in the linear model:

> summary(charter.model)

Call:
lm(formula = naep ~ factor(year) + factor(state) + treatment, 
    data = charter)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.2420  -1.6740  -0.2024   1.8345  12.3580 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 250.4983     1.2115 206.767  < 2e-16 ***
factor(year)1992              3.7970     0.7198   5.275 2.17e-07 ***
factor(year)1996              7.0436     0.8607   8.183 3.64e-15 ***

[..]

Residual standard error: 3.128 on 404 degrees of freedom
  (759 observations deleted due to missingness)
Multiple R-squared:  0.9337,    Adjusted R-squared:  0.9239 
F-statistic: 94.85 on 60 and 404 DF,  p-value: < 2.2e-16

This is what is causing the Error in tapply(x, cluster1, sum) : arguments must have same length error message that you see.

In cl(dat=charter, fm=charter.model, cluster=charter$state) the cluster variable charter$state should have the exact same length as the number of observations effectively used in the regression estimation (which due to NAs is NOT the same as the number of rows in the original data frame).

To work around this you can do the following.

First off you're using an older version of Arai's function (cl) (see Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R for references to both the old or the new versions, the latter being called clx).

Second I think Arai's original approach to this function is a bit convoluted, and doesn't really follow the standard interface of vcov* functions from sandwich. That's why I came with a slightly modified version of clx. I made the code a bit more readable, and the interface more like what you would expect from a sandwich vcov* function:

vcovCL <- function(x, cluster.by, type="sss", dfcw=1){
    # R-codes (www.r-project.org) for computing
    # clustered-standard errors. Mahmood Arai, Jan 26, 2008.

    # The arguments of the function are:
    # fitted model, cluster1 and cluster2
    # You need to install libraries `sandwich' and `lmtest'

    # reweighting the var-cov matrix for the within model
    require(sandwich)
    cluster <- cluster.by
    M <- length(unique(cluster))   
    N <- length(cluster)
    stopifnot(N == length(x$residuals))
    K <- x$rank
    ##only Stata small-sample correction supported right now 
    ##see plm >= 1.5-4
    stopifnot(type=="sss")  
    if(type=="sss"){
        dfc <- (M/(M-1))*((N-1)/(N-K))
    }
    uj  <- apply(estfun(x), 2, function(y) tapply(y, cluster, sum))
    mycov <- dfc * sandwich(x, meat=crossprod(uj)/N) * dfcw
    return(mycov)
}

If you try this function on the data you will see that it catches this specific issue:

> coeftest(charter.model, vcov=function(x) vcovCL(x, charter$state))
 Error: N == length(x$residuals) is not TRUE

To avoid the issue you could proceed as follows:

> charter.x <- na.omit(charter[ , c("state", 
                                  all.vars(formula(charter.model)))])
> coeftest(charter.model, vcov=function(x) vcovCL(x, charter.x$state)) 

t test of coefficients:

                               Estimate  Std. Error     t value  Pr(>|t|)    
(Intercept)                  2.5050e+02  9.3781e-01  2.6711e+02 < 2.2e-16 ***
factor(year)1992             3.7970e+00  5.6019e-01  6.7780e+00 4.330e-11 ***
factor(year)1996             7.0436e+00  8.8574e-01  7.9522e+00 1.856e-14 ***
factor(year)2000             8.4313e+00  1.0906e+00  7.7311e+00 8.560e-14 ***
factor(year)2003             1.2392e+01  1.1670e+00  1.0619e+01 < 2.2e-16 ***
factor(year)2005             1.3490e+01  1.1747e+00  1.1484e+01 < 2.2e-16 ***
factor(year)2007             1.6334e+01  1.2469e+00  1.3100e+01 < 2.2e-16 ***
factor(year)2009             1.8118e+01  1.2556e+00  1.4430e+01 < 2.2e-16 ***
factor(year)2011             1.9110e+01  1.3459e+00  1.4199e+01 < 2.2e-16 ***
factor(year)2013             1.9301e+01  1.4896e+00  1.2957e+01 < 2.2e-16 ***
factor(state)Alaska          1.4178e+01  8.7686e-01  1.6169e+01 < 2.2e-16 ***
factor(state)Arizona         8.6313e+00  8.1439e-01  1.0598e+01 < 2.2e-16 ***
factor(state)Arkansas        4.3313e+00  8.1439e-01  5.3185e+00 1.736e-07 ***
factor(state)California      3.1103e+00  9.1619e-01  3.3948e+00 0.0007549 ***
factor(state)Colorado        1.7939e+01  7.9736e-01  2.2498e+01 < 2.2e-16 ***
factor(state)Connecticut     1.8031e+01  8.1439e-01  2.2141e+01 < 2.2e-16 ***
factor(state)D.C.           -1.8369e+01  8.1439e-01 -2.2555e+01 < 2.2e-16 ***
factor(state)Delaware        1.2050e+01  7.9736e-01  1.5113e+01 < 2.2e-16 ***
factor(state)Florida         7.3838e+00  7.9736e-01  9.2602e+00 < 2.2e-16 ***
factor(state)Georgia         6.4313e+00  8.1439e-01  7.8971e+00 2.724e-14 ***
factor(state)Hawaii          3.3313e+00  8.1439e-01  4.0906e+00 5.196e-05 ***
factor(state)Idaho           1.7118e+01  7.8321e-01  2.1857e+01 < 2.2e-16 ***
factor(state)Illinois        1.2670e+01  8.2224e-01  1.5409e+01 < 2.2e-16 ***
factor(state)Indianna        1.7174e+01  6.1079e-01  2.8117e+01 < 2.2e-16 ***
factor(state)Iowa            2.0074e+01  6.8460e-01  2.9322e+01 < 2.2e-16 ***
factor(state)Kansas          2.0123e+01  8.6796e-01  2.3184e+01 < 2.2e-16 ***
factor(state)Kentucky        1.0200e+01  4.1999e-14  2.4287e+14 < 2.2e-16 ***
factor(state)Louisiana      -1.6866e-01  8.1439e-01 -2.0710e-01 0.8360322    
factor(state)Maine           2.0231e+01  1.7564e-01  1.1518e+02 < 2.2e-16 ***
factor(state)Maryland        1.4274e+01  6.1079e-01  2.3369e+01 < 2.2e-16 ***
factor(state)Massachusetts   2.4868e+01  8.3960e-01  2.9619e+01 < 2.2e-16 ***
factor(state)Michigan        1.2031e+01  8.1439e-01  1.4773e+01 < 2.2e-16 ***
factor(state)Minnesota       2.5110e+01  9.1619e-01  2.7407e+01 < 2.2e-16 ***
factor(state)Mississippi    -3.5470e+00  1.7564e-01 -2.0195e+01 < 2.2e-16 ***
factor(state)Missouri        1.3447e+01  7.2706e-01  1.8495e+01 < 2.2e-16 ***
factor(state)Montana         2.2512e+01  8.4814e-01  2.6543e+01 < 2.2e-16 ***
factor(state)Nebraska        1.9600e+01  4.3105e-14  4.5471e+14 < 2.2e-16 ***
factor(state)Nevada          4.9800e+00  8.6796e-01  5.7375e+00 1.887e-08 ***
factor(state)New Hampshire   2.2026e+01  7.6338e-01  2.8853e+01 < 2.2e-16 ***
factor(state)New Jersey      2.0651e+01  7.6338e-01  2.7052e+01 < 2.2e-16 ***
factor(state)New Mexico      1.5313e+00  8.1439e-01  1.8803e+00 0.0607809 .  
factor(state)New York        1.2152e+01  7.1259e-01  1.7054e+01 < 2.2e-16 ***
factor(state)North Carolina  1.2231e+01  8.1439e-01  1.5019e+01 < 2.2e-16 ***
factor(state)North Dakota    2.4278e+01  1.0420e-01  2.3299e+02 < 2.2e-16 ***
factor(state)Ohio            1.7118e+01  7.8321e-01  2.1857e+01 < 2.2e-16 ***
factor(state)Oklahoma        8.4518e+00  7.8321e-01  1.0791e+01 < 2.2e-16 ***
factor(state)Oregon          1.6535e+01  7.3538e-01  2.2486e+01 < 2.2e-16 ***
factor(state)Pennsylvania    1.6651e+01  7.6338e-01  2.1812e+01 < 2.2e-16 ***
factor(state)Rhode Island    9.5313e+00  8.1439e-01  1.1704e+01 < 2.2e-16 ***
factor(state)South Carolina  9.5346e+00  8.3960e-01  1.1356e+01 < 2.2e-16 ***
factor(state)South Dakota    2.1211e+01  3.5103e-01  6.0425e+01 < 2.2e-16 ***
factor(state)Tennessee       4.9148e+00  6.1473e-01  7.9951e+00 1.375e-14 ***
factor(state)Texas           1.4231e+01  8.1439e-01  1.7475e+01 < 2.2e-16 ***
factor(state)Utah            1.5114e+01  7.2706e-01  2.0787e+01 < 2.2e-16 ***
factor(state)Vermont         2.3474e+01  2.0299e-01  1.1564e+02 < 2.2e-16 ***
factor(state)Virginia        1.6252e+01  7.1259e-01  2.2807e+01 < 2.2e-16 ***
factor(state)Washington      1.9073e+01  1.8183e-01  1.0489e+02 < 2.2e-16 ***
factor(state)West Virginia   5.0000e+00  4.2022e-14  1.1899e+14 < 2.2e-16 ***
factor(state)Wisconsin       1.9994e+01  8.2447e-01  2.4251e+01 < 2.2e-16 ***
factor(state)Wyoming         1.8231e+01  8.1439e-01  2.2386e+01 < 2.2e-16 ***
treatment                    1.2108e+00  1.0180e+00  1.1894e+00 0.2349682    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

It's not nice, but gets the job done. Now cl will also work just fine and yield the same result as above:

cl(dat=charter, fm=charter.model, cluster=charter.x$state)

A better way to go about this is to use the multiwayvcov package. As per the packages's website, it is an improvement upon Arai's code:

Transparent handling of observations dropped due to missingness

Using the Petersen data with simulated NAs and cluster.vcov():

library("lmtest")
library("multiwayvcov")

data(petersen)
set.seed(123)
petersen[ sample(1:5000, 15), 3] <- NA

m1 <- lm(y ~ x, data = petersen)
summary(m1)
## 
## Call:
## lm(formula = y ~ x, data = petersen)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.759 -1.371 -0.018  1.340  8.680 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02793    0.02842   0.983    0.326    
## x            1.03635    0.02865  36.175   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## Residual standard error: 2.007 on 4983 degrees of freedom
##   (15 observations deleted due to missingness)
## Multiple R-squared:  0.208,  Adjusted R-squared:  0.2078 
## F-statistic:  1309 on 1 and 4983 DF,  p-value: < 2.2e-16

coeftest(m1, vcov=function(x) cluster.vcov(x, petersen$firmid))
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.027932   0.067198  0.4157   0.6777    
## x           1.036354   0.050700 20.4407   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

For a different approach using the plm package see:

Double clustered standard errors for panel data

Clustered Standard Errors with data containing NAs

Tags:

r

regression

standard-error

goldisfine

1 Answers

landroni

Recent Activity

Donate For Us

Clustered Standard Errors with data containing NAs

Tags:

r

regression

standard-error

goldisfine

1 Answers

landroni

Related questions

Recent Activity

Donate For Us