Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sample a proportion of each group, BUT with a minimum constraint (using dplyr)

Tags:

r

dplyr

sample

I have a population of 6 categories (stratum) and I want in each stratum to take the 10% as a sample. Doing so I take:

var = c(rep("A",10),rep("B",10),rep("C",3),rep("D",5),"E","F");var
value = rnorm(30)
dat = tibble(var,value);
pop=dat%>%group_by(var)
pop
singleallocperce = slice_sample(pop, prop=0.1);
singleallocperce

with result:

# A tibble: 2 x 2
# Groups:   var [2]
  var   value
  <chr> <dbl>
1 A     -1.54
2 B     -1.12

But I want even if in some stratum that the polupation inside them cannot reach the taken sample of 10% to take at least one observation.How can I do it this in R using dplyr package?

Additional

Additionally if I want to make proportional allocation sampling (ie with weight proportional to the subpopulation of each stratum fro example for A the weight will be : 10/30,for B: 10/30,for C:3/30,D:5/30 etc ) keeping the constrain of 1 observation if the subpopulation does not meet that requirement ?

like image 937
Homer Jay Simpson Avatar asked Oct 18 '25 02:10

Homer Jay Simpson


2 Answers

Possible approach (note the presence of 20 x A to check two are returned).

library(tidyverse)

# Data (note 20 As)
var = c(rep("A",20),rep("B",10),rep("C",3),rep("D",5),"E","F")
value = rnorm(40)
dat = tibble(var, value)

# Possible approach
dat %>%
  group_by(var) %>%
  mutate(min = if_else(n() * 0.1 >= 1, n() * 0.1, 1),
         random = sample(n())) %>%
  filter(random <= min) |> 
  select(var, value)
#> # A tibble: 7 × 2
#> # Groups:   var [6]
#>   var     value
#>   <chr>   <dbl>
#> 1 A      0.0105
#> 2 A      0.171 
#> 3 B     -1.89  
#> 4 C      1.89  
#> 5 D      0.612 
#> 6 E      0.516 
#> 7 F      0.185

Created on 2022-06-02 by the reprex package (v2.0.1)

Weighted version:

library(tidyverse)

# Data (note 20 As)
var = c(rep("A",20),rep("B",10),rep("C",3),rep("D",5),"E","F")
value = rnorm(40)
dat = tibble(var, value)

# Possible approach
dat %>%
  add_count(name = "n_all") %>%
  group_by(var) %>%
  mutate(
    weight = n() / n_all,
    min = if_else(n() * weight >= 1, n() * weight, 1),
    random = sample(n())
  ) %>%
  filter(random <= min) |>
  select(var, value)
#> # A tibble: 16 × 2
#> # Groups:   var [6]
#>    var     value
#>    <chr>   <dbl>
#>  1 A      0.339 
#>  2 A      1.77  
#>  3 A     -0.145 
#>  4 A     -0.915 
#>  5 A      0.146 
#>  6 A      0.896 
#>  7 A     -0.407 
#>  8 A     -1.30  
#>  9 A      1.22  
#> 10 A      0.0527
#> 11 B     -0.602 
#> 12 B     -0.432 
#> 13 C     -0.0540
#> 14 D     -1.45  
#> 15 E      1.54  
#> 16 F      0.879

Created on 2022-06-09 by the reprex package (v2.0.1)

like image 125
Carl Avatar answered Oct 20 '25 16:10

Carl


Here is a potential solution:

sample_func <- function(data) {
  standard <- data %>% 
    group_by(var) %>% 
    slice_sample(prop = 0.1) %>% 
    ungroup()
  
  if(!all(unique(data$var) %in% unique(standard$var))) {
    mins <- data %>% 
      filter(!var %in% standard$var) %>% 
      group_by(var) %>% 
      slice(1) %>% 
      ungroup()
  }
  
  bind_rows(standard, mins) 
  
}

sample_func(dat)

Which gives:

  var     value
  <chr>   <dbl>
1 A      1.36  
2 B     -1.03  
3 C     -0.0450
4 D     -0.380 
5 E     -0.0556
6 F      0.519 

The assumption is that if you are sampling proportionally and do not have any sample for var, that the minimum threshold would be sampling one record from var (by using slice(1)).

like image 25
Matt Avatar answered Oct 20 '25 16:10

Matt



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!