Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stratified random sampling from data frame

Tags:

random

r

sampling

I have a data frame in the format:

head(subset)
# ants  0 1 1 0 1 
# age   1 2 2 1 3
# lc    1 1 0 1 0

I need to create new data frame with random samples according to age and lc. For example I want 30 samples from age:1 and lc:1, 30 samples from age:1 and lc:0 etc.

I did look at random sampling method like;

newdata <- function(subset, age, 30)

But it is not the code that I want.

like image 201
user3525533 Avatar asked May 05 '14 18:05

user3525533


People also ask

Does stratified sampling need a sampling frame?

While not necessary for simple sampling, a sampling frame used for more advanced sample techniques, such as stratified sampling, may contain additional information (such as demographic information).

What is stratified sampling frame?

In a stratified random sample design, the units in the sampling frame are first divided into groups, called strata, and a separate SRS is taken in each stratum to form the total sample. The strata are formed to keep similar units together — for example, a female stratum and a male stratum.

How do you do stratified sampling in pandas?

Stratified Sampling is a sampling technique used to obtain samples that best represent the population. It reduces bias in selecting samples by dividing the population into homogeneous subgroups called strata, and randomly sampling data from each stratum(singular form of strata).

How do you create a stratified sample?

To create a stratified random sample, there are seven steps: (a) defining the population; (b) choosing the relevant stratification; (c) listing the population; (d) listing the population according to the chosen stratification; (e) choosing your sample size; (f) calculating a proportionate stratification; and (g) using ...


2 Answers

I would suggest using either stratified from my "splitstackshape" package, or sample_n from the "dplyr" package:

## Sample data
set.seed(1)
n <- 1e4
d <- data.table(age = sample(1:5, n, T), 
                lc = rbinom(n, 1 , .5),
                ants = rbinom(n, 1, .7))
# table(d$age, d$lc)

For stratified, you basically specify the dataset, the stratifying columns, and an integer representing the size you want from each group OR a decimal representing the fraction you want returned (for example, .1 represents 10% from each group).

library(splitstackshape)
set.seed(1)
out <- stratified(d, c("age", "lc"), 30)
head(out)
#    age lc ants
# 1:   1  0    1
# 2:   1  0    0
# 3:   1  0    1
# 4:   1  0    1
# 5:   1  0    0
# 6:   1  0    1

table(out$age, out$lc)
#    
#      0  1
#   1 30 30
#   2 30 30
#   3 30 30
#   4 30 30
#   5 30 30

For sample_n you first create a grouped table (using group_by) and then specify the number of observations you want. If you wanted proportional sampling instead, you should use sample_frac.

library(dplyr)
set.seed(1)
out2 <- d %>%
  group_by(age, lc) %>%
  sample_n(30)

# table(out2$age, out2$lc)
like image 192
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 19 '22 07:10

A5C1D2H2I1M1N2O1R2T1


Here's some data:

set.seed(1)
n <- 1e4
d <- data.frame(age = sample(1:5,n,TRUE), 
                lc = rbinom(n,1,.5),
                ants = rbinom(n,1,.7))

You want a split-apply-combine strategy, where you split your data.frame (d in this example), sample rows/observations from each subsample, and then combine then back together with rbind. Here's how it works:

sp <- split(d, list(d$age, d$lc))
samples <- lapply(sp, function(x) x[sample(1:nrow(x), 30, FALSE),])
out <- do.call(rbind, samples)

The result:

> str(out)
'data.frame':   300 obs. of  3 variables:
 $ age : int  1 1 1 1 1 1 1 1 1 1 ...
 $ lc  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ants: int  1 1 0 1 1 1 1 1 1 1 ...
> head(out)
         age lc ants
1.0.2242   1  0    1
1.0.4417   1  0    1
1.0.389    1  0    0
1.0.4578   1  0    1
1.0.8170   1  0    1
1.0.5606   1  0    1
like image 17
Thomas Avatar answered Oct 19 '22 07:10

Thomas