I have survey data from a stratified simple random sampling design where some strata contain only a single sampling unit (even though the strata population size may be >1). These are referred to as "lonely PSUs" in the R survey package (http://r-survey.r-forge.r-project.org/survey/exmample-lonely.html). There are several options for dealing with this situation and the one I'm interested in is the "adjust" option.
The documentation for options(survey.lonely.psu="adjust")
states that
"the data for the single-PSU stratum are centered at the sample grand mean rather than the stratum mean."
In experimenting with this option I had expected that the variance for my single-PSU stratum would change if I changed the data in another stratum but that wasn't the case. Here is a small example:
library(survey)
options(survey.lonely.psu="adjust")
# sample 1
dat1 <- data.frame(N = c(3, 3, 2), h = c(1, 1, 2), y = c(2, 6, 15))
survey1 <- svydesign(~1, fpc = ~N, strata = ~h, data = dat1)
svyby(~y, by = ~h, design = survey1, FUN = svytotal)
In the results note the standard error for strata 2, which is the single-PSU stratum:
h y se
1 1 12 3.464102
2 2 30 21.213203
Now if I change the data in strata 1 like so
# sample 2
(dat2 <- data.frame(N = c(3, 3, 2), h = c(1, 1, 2), y = c(200, 600, 15)))
(survey2 <- svydesign(~1, fpc = ~N, strata = ~h, data = dat2))
svyby(~y, by = ~h, design = survey2, FUN = svytotal)
The results change as expected for stratum 1, but the standard error is still the same for stratum 2
h y se
1 1 1200 346.4102
2 2 30 21.2132
Am I mis-interpreting what the documentation means or might this be a bug?
FYI, here is my sessionInfo:
R version 3.1.3 (2015-03-09)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
other attached packages:
[1] survey_3.30-3
EDIT
My interpretation of the initial answers I received was that the variance adjustment doesn't take effect when the data are subset using the svyby
function. However, when I compare the variances of the totals for the stratified population with the variance for the total population it appears that, just as when there are no lonely PSUs, the variance of the total is simply the variance of the independently sampled stratum variances:
> vcov(svyby(~y, by = ~h, design = survey1, FUN = svytotal))
1 2
1 12 0
2 0 450
> vcov(svytotal(~y, survey1))
y
y 462
It seems that the latter variance should be different if there is some sort of centering to the grand mean going on when all the data are combined.
As a related question, here is a comparison of svyby
when estimating means versus totals:
> svyby(~y, by = ~h, design = survey1, FUN = svymean)
h y se
1 1 4 1.155
2 2 15 0.000
> svyby(~y, by = ~h, design = survey1, FUN = svytotal)
h y se
1 1 12 3.464
2 2 30 21.213
I'm confused here as to why there is a variance estimated for stratum 2 (which contains a lonley PSU) when estimating a total but not when estimating a mean.
Analysis of a subset of a complex design is basically equivalent to analysis setting the sampling weights for observations outside the subset to zero. That means these observations don't enter into the 'grand mean'.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With