Stratified random sampling from data frame

Tags:

I have a data frame in the format:

head(subset)
# ants  0 1 1 0 1 
# age   1 2 2 1 3
# lc    1 1 0 1 0

I need to create new data frame with random samples according to age and lc. For example I want 30 samples from age:1 and lc:1, 30 samples from age:1 and lc:0 etc.

I did look at random sampling method like;

newdata <- function(subset, age, 30)

But it is not the code that I want.

201

asked May 05 '14 18:05

user3525533

2 Answers

I would suggest using either stratified from my "splitstackshape" package, or sample_n from the "dplyr" package:

## Sample data
set.seed(1)
n <- 1e4
d <- data.table(age = sample(1:5, n, T), 
                lc = rbinom(n, 1 , .5),
                ants = rbinom(n, 1, .7))
# table(d$age, d$lc)

For stratified, you basically specify the dataset, the stratifying columns, and an integer representing the size you want from each group OR a decimal representing the fraction you want returned (for example, .1 represents 10% from each group).

library(splitstackshape)
set.seed(1)
out <- stratified(d, c("age", "lc"), 30)
head(out)
#    age lc ants
# 1:   1  0    1
# 2:   1  0    0
# 3:   1  0    1
# 4:   1  0    1
# 5:   1  0    0
# 6:   1  0    1

table(out$age, out$lc)
#    
#      0  1
#   1 30 30
#   2 30 30
#   3 30 30
#   4 30 30
#   5 30 30

For sample_n you first create a grouped table (using group_by) and then specify the number of observations you want. If you wanted proportional sampling instead, you should use sample_frac.

library(dplyr)
set.seed(1)
out2 <- d %>%
  group_by(age, lc) %>%
  sample_n(30)

# table(out2$age, out2$lc)

192

answered Oct 19 '22 07:10

A5C1D2H2I1M1N2O1R2T1

Here's some data:

set.seed(1)
n <- 1e4
d <- data.frame(age = sample(1:5,n,TRUE), 
                lc = rbinom(n,1,.5),
                ants = rbinom(n,1,.7))

You want a split-apply-combine strategy, where you split your data.frame (d in this example), sample rows/observations from each subsample, and then combine then back together with rbind. Here's how it works:

sp <- split(d, list(d$age, d$lc))
samples <- lapply(sp, function(x) x[sample(1:nrow(x), 30, FALSE),])
out <- do.call(rbind, samples)

The result:

> str(out)
'data.frame':   300 obs. of  3 variables:
 $ age : int  1 1 1 1 1 1 1 1 1 1 ...
 $ lc  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ants: int  1 1 0 1 1 1 1 1 1 1 ...
> head(out)
         age lc ants
1.0.2242   1  0    1
1.0.4417   1  0    1
1.0.389    1  0    0
1.0.4578   1  0    1
1.0.8170   1  0    1
1.0.5606   1  0    1

answered Oct 19 '22 07:10

Thomas

Related questions
                            
                                Connecting across missing values with geom_line
                            
                                Showing different axis labels using ggplot2 with facet_wrap
                            
                                How expensive is it to compute the eigenvalues of a matrix?
                            
                                How do I put more space between the axis labels and axis title in an R boxplot
                            
                                R equivalent of SELECT DISTINCT on two or more fields/variables
                            
                                geom_bar bars not displaying when specifying ylim
                            
                                Vectorizing a matrix [duplicate]
                            
                                How to subset from a list in R
                            
                                Formatting mouse over labels in plotly when using ggplotly
                            
                                Count the number of non-zero elements of each column
                            
                                dplyr - groupby on multiple columns using variable names
                            
                                Error in printing data.frame in excel using XLSX package in R
                            
                                long/bigint/decimal equivalent datatype in R
                            
                                Reshaping wide to long with multiple values columns [duplicate]
                            
                                Combine (rbind) data frames and create column with name of original data frames
                            
                                Is it possible to get the number of rows in a CSV file without opening it?
                            
                                Generating Random Strings
                            
                                Using geom_line with multiple groupings
                            
                                Creating co-occurrence matrix
                            
                                Represent numeric value with typical dollar amount format

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Stratified random sampling from data frame

Tags:

random

r

sampling

user3525533

People also ask

2 Answers

A5C1D2H2I1M1N2O1R2T1

Thomas

Recent Activity

Donate For Us