Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Stratified random sample proportion of unique ID's by grouping variable

With the following sample dataframe I would like to draw a stratified random sample (e.g., 40%) of the ID's "ID" from each level of the factor "Cohort":

data<-structure(list(Cohort = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), ID = structure(1:20, .Label = c("a1 ", 
"a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9", "b10", "b11", 
"b12", "b13", "b14", "b15", "b16", "b17", "b18", "b19", "b20"
), class = "factor")), .Names = c("Cohort", "ID"), class = "data.frame", row.names = c(NA, 
-20L))

I only know how to draw a random number of rows using the following:

library(dplyr)
data %>% 
group_by(Cohort) %>%
sample_n(size = 10)

But my actual data are longitudinal so I have multiple cases of the same ID within each cohort and several cohorts of different sizes, thus the need to select a proportion of unique ID's. Any assistance would be appreciated.

like image 655
user3594490 Avatar asked Nov 21 '15 00:11

user3594490


People also ask

How do you calculate proportionate stratified random sampling?

For example, if the researcher wanted a sample of 50,000 graduates using age range, the proportionate stratified random sample will be obtained using this formula: (sample size/population size) × stratum size.

Does stratified sampling have to be proportional?

It can either be proportional or disproportional stratified sampling. The researcher can then select random elements from each stratum to form the sample.

Can you stratify on more than one variable?

You may stratify with one or multiple variables; as the number of variables increases, so does the number of strata. For example, if you are stratifying on Variable A (i.e. 3 education groupings), Variable B (i.e. 2 geographic groupings), and Variable C (i.e. 3 age groupings), you will have 18 distinct strata.

How do you find proportionate sampling?

Formula Reviewp′ = x / n where x represents the number of successes and n represents the sample size. The variable p′ is the sample proportion and serves as the point estimate for the true population proportion.


1 Answers

Here's one way:

data %>% group_by(Cohort) %>%
  filter(ID %in% sample(unique(ID), ceiling(0.4*length(unique(ID)))))

This will return all rows containing the randomly sampled IDs. In other words, I'm assuming you have measurements that go with each row and that you want all the measurements for each sampled ID. (If you just want one row returned for each sampled ID then @bramtayl's answer will do that.)

For example:

data = data.frame(rbind(data, data), value=rnorm(2*nrow(data)))

data %>% group_by(Cohort) %>%
  filter(ID %in% sample(unique(ID), ceiling(0.4*length(unique(ID)))))

   Cohort     ID       value
    (int) (fctr)       (dbl)
1       1    a1  -0.92370760
2       1     a2 -0.37230655
3       1     a3 -1.27037502
4       1     a7 -0.34545295
5       2    b14 -2.08205561
6       2    b17  0.31393998
7       2    b18 -0.02250819
8       2    b19  0.53065857
9       2    b20  0.03924414
10      1    a1  -0.08275011
11      1     a2 -0.10036822
12      1     a3  1.42397042
13      1     a7 -0.35203237
14      2    b14  0.30422865
15      2    b17 -1.82008014
16      2    b18  1.67548568
17      2    b19  0.74324596
18      2    b20  0.27725794
like image 125
eipi10 Avatar answered Sep 28 '22 18:09

eipi10