I have a data frame in the format:
head(subset)
# ants 0 1 1 0 1
# age 1 2 2 1 3
# lc 1 1 0 1 0
I need to create new data frame with random samples according to age and lc. For example I want 30 samples from age:1 and lc:1, 30 samples from age:1 and lc:0 etc.
I did look at random sampling method like;
newdata <- function(subset, age, 30)
But it is not the code that I want.
While not necessary for simple sampling, a sampling frame used for more advanced sample techniques, such as stratified sampling, may contain additional information (such as demographic information).
In a stratified random sample design, the units in the sampling frame are first divided into groups, called strata, and a separate SRS is taken in each stratum to form the total sample. The strata are formed to keep similar units together — for example, a female stratum and a male stratum.
Stratified Sampling is a sampling technique used to obtain samples that best represent the population. It reduces bias in selecting samples by dividing the population into homogeneous subgroups called strata, and randomly sampling data from each stratum(singular form of strata).
To create a stratified random sample, there are seven steps: (a) defining the population; (b) choosing the relevant stratification; (c) listing the population; (d) listing the population according to the chosen stratification; (e) choosing your sample size; (f) calculating a proportionate stratification; and (g) using ...
I would suggest using either stratified
from my "splitstackshape" package, or sample_n
from the "dplyr" package:
## Sample data
set.seed(1)
n <- 1e4
d <- data.table(age = sample(1:5, n, T),
lc = rbinom(n, 1 , .5),
ants = rbinom(n, 1, .7))
# table(d$age, d$lc)
For stratified
, you basically specify the dataset, the stratifying columns, and an integer representing the size you want from each group OR a decimal representing the fraction you want returned (for example, .1 represents 10% from each group).
library(splitstackshape)
set.seed(1)
out <- stratified(d, c("age", "lc"), 30)
head(out)
# age lc ants
# 1: 1 0 1
# 2: 1 0 0
# 3: 1 0 1
# 4: 1 0 1
# 5: 1 0 0
# 6: 1 0 1
table(out$age, out$lc)
#
# 0 1
# 1 30 30
# 2 30 30
# 3 30 30
# 4 30 30
# 5 30 30
For sample_n
you first create a grouped table (using group_by
) and then specify the number of observations you want. If you wanted proportional sampling instead, you should use sample_frac
.
library(dplyr)
set.seed(1)
out2 <- d %>%
group_by(age, lc) %>%
sample_n(30)
# table(out2$age, out2$lc)
Here's some data:
set.seed(1)
n <- 1e4
d <- data.frame(age = sample(1:5,n,TRUE),
lc = rbinom(n,1,.5),
ants = rbinom(n,1,.7))
You want a split-apply-combine strategy, where you split
your data.frame (d
in this example), sample rows/observations from each subsample, and then combine then back together with rbind
. Here's how it works:
sp <- split(d, list(d$age, d$lc))
samples <- lapply(sp, function(x) x[sample(1:nrow(x), 30, FALSE),])
out <- do.call(rbind, samples)
The result:
> str(out)
'data.frame': 300 obs. of 3 variables:
$ age : int 1 1 1 1 1 1 1 1 1 1 ...
$ lc : int 0 0 0 0 0 0 0 0 0 0 ...
$ ants: int 1 1 0 1 1 1 1 1 1 1 ...
> head(out)
age lc ants
1.0.2242 1 0 1
1.0.4417 1 0 1
1.0.389 1 0 0
1.0.4578 1 0 1
1.0.8170 1 0 1
1.0.5606 1 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With