Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use rep_slice_sample() to randomly sample within groups of varying observation number

I am building confidence intervals for groups with bootstrapped values and I'm having trouble creating multiple re-sampled datasets from which to build my confidence intervals.

Using the palmerpenguins library as an example:

library(tidyverse)
library(infer)
library(palmerpenguins)

There are 344 total observations and each species has a different number of observations:

nrow(penguins)
[1] 344

penguins %>% group_by(species) %>% count()

# A tibble: 3 × 2
# Groups:   species [3]
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

I want to be able to group by the species, and for each species pull multiple samples while using the original number of observations per each group.

set.seed(100)

slices <- penguins2 %>% 
    group_by(species) %>% 
    rep_slice_sample(prop = 1, replace = TRUE, reps = 10)

That should give me 344 * 10 = 3440 lines in the full new data set. This is true, but when you look at the data you can see that each replicate has a different number of observations. For all of the Adelie, n per sample should be 152, chinstrap should be 68, and Gentoo should be 124. Instead we find this:

slices %>% group_by(species, replicate) %>% count()

# A tibble: 30 × 3
# Groups:   species, replicate [30]
   species replicate     n
   <fct>       <int> <int>
 1 Adelie          1   148
 2 Adelie          2   147
 3 Adelie          3   148
 4 Adelie          4   151
 5 Adelie          5   138
 6 Adelie          6   157
 7 Adelie          7   161
 8 Adelie          8   157
 9 Adelie          9   151
10 Adelie         10   138
# ℹ 20 more rows
# ℹ Use `print(n = ...)` to see more rows

What am I missing?

like image 943
Adrienne B Avatar asked Aug 30 '25 16:08

Adrienne B


1 Answers

Another option with slice_sample:

(penguins %>% slice_sample(prop = 10, replace = TRUE, by = species) also works (i.e. with prop = 10), but doesn't provide the replicate number.)

library(tidyverse)
library(palmerpenguins)

set.seed(100)

slices <- map(1:10, \(x)(
  penguins %>% 
    slice_sample(prop = 1, replace = TRUE, by = species) |> 
    mutate(replicate = x)
)) |> 
  bind_rows()

slices %>% count(species, replicate)
#> # A tibble: 30 × 3
#>    species replicate     n
#>    <fct>       <int> <int>
#>  1 Adelie          1   152
#>  2 Adelie          2   152
#>  3 Adelie          3   152
#>  4 Adelie          4   152
#>  5 Adelie          5   152
#>  6 Adelie          6   152
#>  7 Adelie          7   152
#>  8 Adelie          8   152
#>  9 Adelie          9   152
#> 10 Adelie         10   152
#> # ℹ 20 more rows

Created on 2024-03-17 with reprex v2.1.0

like image 107
Carl Avatar answered Sep 03 '25 09:09

Carl