I have a population of 6 categories (stratum) and I want in each stratum to take the 10% as a sample. Doing so I take:
var = c(rep("A",10),rep("B",10),rep("C",3),rep("D",5),"E","F");var
value = rnorm(30)
dat = tibble(var,value);
pop=dat%>%group_by(var)
pop
singleallocperce = slice_sample(pop, prop=0.1);
singleallocperce
with result:
# A tibble: 2 x 2
# Groups: var [2]
var value
<chr> <dbl>
1 A -1.54
2 B -1.12
But I want even if in some stratum that the polupation inside them cannot reach the taken sample of 10% to take at least one observation.How can I do it this in R using dplyr package?
Additional
Additionally if I want to make proportional allocation sampling (ie with weight proportional to the subpopulation of each stratum fro example for A the weight will be : 10/30,for B: 10/30,for C:3/30,D:5/30 etc ) keeping the constrain of 1 observation if the subpopulation does not meet that requirement ?
Possible approach (note the presence of 20 x A to check two are returned).
library(tidyverse)
# Data (note 20 As)
var = c(rep("A",20),rep("B",10),rep("C",3),rep("D",5),"E","F")
value = rnorm(40)
dat = tibble(var, value)
# Possible approach
dat %>%
group_by(var) %>%
mutate(min = if_else(n() * 0.1 >= 1, n() * 0.1, 1),
random = sample(n())) %>%
filter(random <= min) |>
select(var, value)
#> # A tibble: 7 × 2
#> # Groups: var [6]
#> var value
#> <chr> <dbl>
#> 1 A 0.0105
#> 2 A 0.171
#> 3 B -1.89
#> 4 C 1.89
#> 5 D 0.612
#> 6 E 0.516
#> 7 F 0.185
Created on 2022-06-02 by the reprex package (v2.0.1)
Weighted version:
library(tidyverse)
# Data (note 20 As)
var = c(rep("A",20),rep("B",10),rep("C",3),rep("D",5),"E","F")
value = rnorm(40)
dat = tibble(var, value)
# Possible approach
dat %>%
add_count(name = "n_all") %>%
group_by(var) %>%
mutate(
weight = n() / n_all,
min = if_else(n() * weight >= 1, n() * weight, 1),
random = sample(n())
) %>%
filter(random <= min) |>
select(var, value)
#> # A tibble: 16 × 2
#> # Groups: var [6]
#> var value
#> <chr> <dbl>
#> 1 A 0.339
#> 2 A 1.77
#> 3 A -0.145
#> 4 A -0.915
#> 5 A 0.146
#> 6 A 0.896
#> 7 A -0.407
#> 8 A -1.30
#> 9 A 1.22
#> 10 A 0.0527
#> 11 B -0.602
#> 12 B -0.432
#> 13 C -0.0540
#> 14 D -1.45
#> 15 E 1.54
#> 16 F 0.879
Created on 2022-06-09 by the reprex package (v2.0.1)
Here is a potential solution:
sample_func <- function(data) {
standard <- data %>%
group_by(var) %>%
slice_sample(prop = 0.1) %>%
ungroup()
if(!all(unique(data$var) %in% unique(standard$var))) {
mins <- data %>%
filter(!var %in% standard$var) %>%
group_by(var) %>%
slice(1) %>%
ungroup()
}
bind_rows(standard, mins)
}
sample_func(dat)
Which gives:
var value
<chr> <dbl>
1 A 1.36
2 B -1.03
3 C -0.0450
4 D -0.380
5 E -0.0556
6 F 0.519
The assumption is that if you are sampling proportionally and do not have any sample for var, that the minimum threshold would be sampling one record from var (by using slice(1)).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With