Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using "adjust" option for lonely PSUs in R survey package: why don't single-PSU strata variances change when data in other strata change?

Tags:

r

survey

variance

I have survey data from a stratified simple random sampling design where some strata contain only a single sampling unit (even though the strata population size may be >1). These are referred to as "lonely PSUs" in the R survey package (http://r-survey.r-forge.r-project.org/survey/exmample-lonely.html). There are several options for dealing with this situation and the one I'm interested in is the "adjust" option.

The documentation for options(survey.lonely.psu="adjust") states that

"the data for the single-PSU stratum are centered at the sample grand mean rather than the stratum mean."

In experimenting with this option I had expected that the variance for my single-PSU stratum would change if I changed the data in another stratum but that wasn't the case. Here is a small example:

library(survey)
options(survey.lonely.psu="adjust")

# sample 1
dat1 <- data.frame(N = c(3, 3, 2), h = c(1, 1, 2), y = c(2, 6, 15))
survey1 <- svydesign(~1, fpc = ~N, strata = ~h, data = dat1)
svyby(~y, by = ~h, design = survey1, FUN = svytotal)

In the results note the standard error for strata 2, which is the single-PSU stratum:

  h  y        se
1 1 12  3.464102
2 2 30 21.213203

Now if I change the data in strata 1 like so

# sample 2
(dat2 <- data.frame(N = c(3, 3, 2), h = c(1, 1, 2), y = c(200, 600, 15)))
(survey2 <- svydesign(~1, fpc = ~N, strata = ~h, data = dat2))
svyby(~y, by = ~h, design = survey2, FUN = svytotal)

The results change as expected for stratum 1, but the standard error is still the same for stratum 2

  h    y       se
1 1 1200 346.4102
2 2   30  21.2132

Am I mis-interpreting what the documentation means or might this be a bug?

FYI, here is my sessionInfo:

R version 3.1.3 (2015-03-09)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
other attached packages:
[1] survey_3.30-3

EDIT

My interpretation of the initial answers I received was that the variance adjustment doesn't take effect when the data are subset using the svyby function. However, when I compare the variances of the totals for the stratified population with the variance for the total population it appears that, just as when there are no lonely PSUs, the variance of the total is simply the variance of the independently sampled stratum variances:

> vcov(svyby(~y, by = ~h, design = survey1, FUN = svytotal))
   1   2
1 12   0
2  0 450

> vcov(svytotal(~y, survey1))
    y
y 462

It seems that the latter variance should be different if there is some sort of centering to the grand mean going on when all the data are combined.

As a related question, here is a comparison of svyby when estimating means versus totals:

> svyby(~y, by = ~h, design = survey1, FUN = svymean)
  h  y    se
1 1  4 1.155
2 2 15 0.000

> svyby(~y, by = ~h, design = survey1, FUN = svytotal)
  h  y     se
1 1 12  3.464
2 2 30 21.213

I'm confused here as to why there is a variance estimated for stratum 2 (which contains a lonley PSU) when estimating a total but not when estimating a mean.

like image 579
Bryan Avatar asked Sep 26 '22 18:09

Bryan


1 Answers

Analysis of a subset of a complex design is basically equivalent to analysis setting the sampling weights for observations outside the subset to zero. That means these observations don't enter into the 'grand mean'.

like image 197
Thomas Lumley Avatar answered Oct 11 '22 13:10

Thomas Lumley