Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use sample weights in GAM (mgcv) on survey data for Logit regression?

I'm interesting in performing a GAM regression on data from a national wide survey which presents sample weights. I read with interest this post. I selected my vars of interest generating a DF:

nhanesAnalysis <- nhanesDemo %>%
                    select(fpl,
                           age,
                           gender,
                           persWeight,
                           psu,
                           strata)

Than, for what I understood, I generated a weighted DF with the following code:

library(survey)    
nhanesDesign <- svydesign(    id      = ~psu,
                              strata  = ~strata,
                              weights = ~persWeight,
                              nest    = TRUE,
                              data    = nhanesAnalysis)

Let's say that I would select only subjects with age≥30:

ageDesign <- subset(nhanesDesign, age >= 30)

Now, I would fit a GAM model (fpl ~ s(age) + gender) with mgcv package. Is it possible to do so with the weights argument or using svydesign object ageDesign ?

EDIT

I was wondering if is it correct to extrapolate computed weights from the an svyglm object and use it for weights argument in GAM.

like image 247
Borexino Avatar asked May 26 '19 13:05

Borexino


People also ask

How do you use survey weights on data?

This is done by calculating Target divided by Current. So for example, 8/30 = 0.27 (2 decimal places). Finally, in order to calculate the weighted number of participants we must now multiply the number of respondents by the weight. So for example, 150 * 0.27 = 40.

How do you use sampling weight?

Sampling weights are often thereciprocalof the likelihood of being sampled (i.e., selection probability) of the sampling unit. For example, if you have selected 200 goldfish out of a population of 1000, the reciprocal of the likelihood of being selected is 1000/200, so the sampling weight for the goldfish would be 5.

Why do we use sampling weights?

Sampling weights are needed to correct for imperfections in the sample that might lead to bias and other departures between the sample and the reference population. Such imperfections include the selection of units with unequal probabilities, non-coverage of the population, and non-response.

How do you calculate sampling weight in R?

The probability weight is calculated as N/n, where N = the number of elements in the population and n = the number of elements in the sample. For example, if a population has 10 elements and 3 are sampled at random with replacement, then the probability weight would be 10/3 = 3.33.


1 Answers

This is more difficult than it looks. There are two issues

  1. You want to get the right amount of smoothing
  2. You want valid standard errors.

Just giving the sampling weights to mgcv::gam() won't do either of these: gam() treats the weights as frequency weights and so will think it has a lot more data than it actually has. You will get undersmoothing and underestimated standard errors because of the weights, and you will also likely get underestimated standard errors because of the cluster sampling.

The simple work-around is to use regression splines (splines package) instead. These aren't quite as good as the penalised splines used by mgcv, but the difference usually isn't a big deal, and they work straightforwardly with svyglm. You do need to choose how many degrees of freedom to assign.

library(splines)
svglm(fpl ~ ns(age,4) + gender, design = nhanesDesign)
like image 198
Thomas Lumley Avatar answered Sep 21 '22 19:09

Thomas Lumley