With the following sample dataframe I would like to draw a stratified random sample (e.g., 40%) of the ID's "ID" from each level of the factor "Cohort": <pre class="prettyprint"><code>data<-structure(list(Cohort = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), ID = structure(1:20, .Label = c("a1 ", "a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9", "b10", "b11", "b12", "b13", "b14", "b15", "b16", "b17", "b18", "b19", "b20" ), class = "factor")), .Names = c("Cohort", "ID"), class = "data.frame", row.names = c(NA, -20L)) </code></pre> I only know how to draw a random number of rows using the following: <pre class="prettyprint"><code>library(dplyr) data %>% group_by(Cohort) %>% sample_n(size = 10) </code></pre> But my actual data are longitudinal so I have multiple cases of the same ID within each cohort and several cohorts of different sizes, thus the need to select a proportion of unique ID's. Any assistance would be appreciated.

Here's one way: <pre class="prettyprint"><code>data %>% group_by(Cohort) %>% filter(ID %in% sample(unique(ID), ceiling(0.4*length(unique(ID))))) </code></pre> This will return all rows containing the randomly sampled IDs. In other words, I'm assuming you have measurements that go with each row and that you want all the measurements for each sampled ID. (If you just want one row returned for each sampled ID then @bramtayl's answer will do that.) For example: <pre class="prettyprint"><code>data = data.frame(rbind(data, data), value=rnorm(2*nrow(data))) data %>% group_by(Cohort) %>% filter(ID %in% sample(unique(ID), ceiling(0.4*length(unique(ID))))) Cohort ID value (int) (fctr) (dbl) 1 1 a1 -0.92370760 2 1 a2 -0.37230655 3 1 a3 -1.27037502 4 1 a7 -0.34545295 5 2 b14 -2.08205561 6 2 b17 0.31393998 7 2 b18 -0.02250819 8 2 b19 0.53065857 9 2 b20 0.03924414 10 1 a1 -0.08275011 11 1 a2 -0.10036822 12 1 a3 1.42397042 13 1 a7 -0.35203237 14 2 b14 0.30422865 15 2 b17 -1.82008014 16 2 b18 1.67548568 17 2 b19 0.74324596 18 2 b20 0.27725794 </code></pre>

R: Stratified random sample proportion of unique ID's by grouping variable

Tags:

random

r

dplyr

sampling

With the following sample dataframe I would like to draw a stratified random sample (e.g., 40%) of the ID's "ID" from each level of the factor "Cohort":

data<-structure(list(Cohort = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), ID = structure(1:20, .Label = c("a1 ", 
"a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9", "b10", "b11", 
"b12", "b13", "b14", "b15", "b16", "b17", "b18", "b19", "b20"
), class = "factor")), .Names = c("Cohort", "ID"), class = "data.frame", row.names = c(NA, 
-20L))

I only know how to draw a random number of rows using the following:

library(dplyr)
data %>% 
group_by(Cohort) %>%
sample_n(size = 10)

But my actual data are longitudinal so I have multiple cases of the same ID within each cohort and several cohorts of different sizes, thus the need to select a proportion of unique ID's. Any assistance would be appreciated.

655

asked Nov 21 '15 00:11

user3594490

1 Answers

Here's one way:

data %>% group_by(Cohort) %>%
  filter(ID %in% sample(unique(ID), ceiling(0.4*length(unique(ID)))))

This will return all rows containing the randomly sampled IDs. In other words, I'm assuming you have measurements that go with each row and that you want all the measurements for each sampled ID. (If you just want one row returned for each sampled ID then @bramtayl's answer will do that.)

For example:

data = data.frame(rbind(data, data), value=rnorm(2*nrow(data)))

data %>% group_by(Cohort) %>%
  filter(ID %in% sample(unique(ID), ceiling(0.4*length(unique(ID)))))

   Cohort     ID       value
    (int) (fctr)       (dbl)
1       1    a1  -0.92370760
2       1     a2 -0.37230655
3       1     a3 -1.27037502
4       1     a7 -0.34545295
5       2    b14 -2.08205561
6       2    b17  0.31393998
7       2    b18 -0.02250819
8       2    b19  0.53065857
9       2    b20  0.03924414
10      1    a1  -0.08275011
11      1     a2 -0.10036822
12      1     a3  1.42397042
13      1     a7 -0.35203237
14      2    b14  0.30422865
15      2    b17 -1.82008014
16      2    b18  1.67548568
17      2    b19  0.74324596
18      2    b20  0.27725794

125

answered Sep 28 '22 18:09

eipi10

Related questions
                            
                                Adding values to barplot of table in R
                            
                                min and max in Rcpp programs
                            
                                Scraping a complex HTML table into a data.frame in R
                            
                                Set default values for function parameters in R
                            
                                Problems using dplyr in a function (group_by)
                            
                                Use dygraph for R to plot xts time series by year only?
                            
                                R: Find variables supplied to functions with the '...' argument with exists()
                            
                                Write custom classifier in R and predict function
                            
                                How to change node and link colors in R googleVis sankey chart
                            
                                Merge rows with equal and unequal data
                            
                                Group similar numbers of a vector
                            
                                Use of $ and %% operators in R
                            
                                Plot density with ggplot2 without line on x-axis
                            
                                Change directory in R
                            
                                For each row, get column names where data is equal to a certain value
                            
                                Truncate but NOT Round in R [duplicate]
                            
                                Creating a data partition using caret and data.table
                            
                                elegant way to loop over chunks with remainder in r?
                            
                                Removing data from one dataframe that exists in another dataframe R
                            
                                Creating a Shiny app with real time data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With