I want to sample rows from a data frame using unequal sample sizes from each group.
Let's say we have a simple data frame grouped by 'group':
library(dplyr)
set.seed(123)
df <- data.frame(group = rep(c("A", "B"), each = 10),
value = rnorm(10))
df
#> group value
#> 1 A -0.56047565
#> 2 A -0.23017749
#> .....
#> 10 A -0.44566197
#> 11 B -0.56047565
#> 12 B -0.23017749
#> .....
#> 20 B -0.44566197
Using the slice_sample
function from the dplyr
package, you can easily slice equally sized groups from this dataframe:
df %>% group_by(group) %>% slice_sample(n = 2) %>% ungroup()
#> # A tibble: 4 x 2
#> group value
#> <fct> <dbl>
#> 1 A -0.687
#> 2 A -0.446
#> 3 B -0.687
#> 4 B 1.56
Question
How do you sample a different number of values from each group (slice groups that are not equal in size)? For example, sample 4 rows from group A, and 5 rows from group B?
The easiest thing I can think of is a map2
solution using purrr
.
library(dplyr)
library(purrr)
df %>%
group_split(group) %>%
map2_dfr(c(4, 5), ~ slice_sample(.x, n = .y))
# A tibble: 9 x 2
group value
<chr> <dbl>
1 A -0.687
2 A 1.56
3 A 0.0705
4 A 1.72
5 B -0.560
6 B 0.461
7 B 0.129
8 B 0.0705
9 B -0.230
A caution is that you need to understand the order of the split. I think group_split()
will sort the group as factors. A way around that would be to adapt like this, and lookup the n
from a named vector.
group_slice_n <- c(A = 4, B = 5)
df %>%
split(.$group) %>%
imap_dfr(~ slice_sample(.x, n = group_slice_n[.y]))
Try this:
group_sizes <- tibble(group = c("A", "B"), size = c(4, 5))
set.seed(2021)
df %>%
left_join(group_sizes, by = "group") %>%
group_by(group) %>%
mutate(samp = sample(n())) %>%
filter(samp <= size) %>%
ungroup()
# # A tibble: 9 x 4
# group value size samp
# <chr> <dbl> <dbl> <int>
# 1 A 0.0705 4 2
# 2 A 0.129 4 4
# 3 A -0.687 4 1
# 4 A -0.446 4 3
# 5 B -0.560 5 5
# 6 B 1.56 5 1
# 7 B 0.129 5 4
# 8 B 1.72 5 3
# 9 B -1.27 5 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With