I have a dataframe which contains multiple samples (1-n) per group. I would like to sample this dataset, without replacement, so that I have a maximum of 5 samples per group (1-5).
This problem has previously been described and answered here. In this question @evolvedmicrobe's answer was the most satisfactory for me and has worked in the past. This seems to have broken in the last year or so.
Here is a workable example of what I would like to do:
From mtcars, there are different numbers of rows when grouped by "cyl".
table(mtcars$cyl)
4 6 8
11 7 14
I would like to create a sub-sample where the maximum number of cars per group cyl is ten. The resulting number of rows would theoretically look like:
table(subsample$cyl)
4 6 8
10 7 10
My naive attempt at this was:
library(dplyr)
subsample <- mtcars %>% group_by(cyl) %>% sample_n(10) %>% ungroup()
However, because one group has fewer than 10 rows:
Error:
size
must be less or equal than 7 (size of data), setreplace
= TRUE to use sampling with replacement
@evolvedmicrobe's answer to this was to create a custom sampling function:
### Custom sampler function to sample min(data, sample) which can't be done with dplyr
### it's a modified copy of sample_n.grouped_df
sample_vals <- function (tbl, size, replace = FALSE, weight = NULL, .env = parent.frame())
{
#assert_that(is.numeric(size), length(size) == 1, size >= 0)
weight <- substitute(weight)
index <- attr(tbl, "indices")
sizes = sapply(index, function(z) min(length(z), size)) # here's my contribution
sampled <- lapply(1:length(index), function(i) dplyr:::sample_group(index[[i]], frac = FALSE, tbl = tbl,
size = sizes[i], replace = replace, weight = weight, .env = .env))
idx <- unlist(sampled) + 1
grouped_df(tbl[idx, , drop = FALSE], vars = groups(tbl))
}
samped_data = dataset %>% group_by(something) %>% sample_vals(size = 50000) %>% ungroup()
This function has worked in the past, I've just tried re-running it but it no longer works, instead, it throws back the same error as it currently does for the mtcars example:
library(dplyr)
subsample <- mtcars %>% group_by(cyl) %>% sample_vals(10) %>% ungroup()
Error in dplyr:::sample_group(index[[i]], frac = FALSE, tbl = tbl, size = sizes[i], : unused argument (tbl = tbl) Called from: FUN(X[[i]], ...)
Has anyone got a better way of sampling by group, without replacement, up to a maximum size per group? I'm not ordinarily a big user of dplyr, so all options from base R or other packages are also welcome.
Otherwise, does anyone have an idea why the previous work-around has stopped working?
Thanks for everyone's time.
Here's a simple solution using slice
-
samples_per_group <- 10
subsample <- mtcars %>%
group_by(cyl) %>%
slice(sample(n(), min(samples_per_group, n()))) %>%
ungroup()
table(subsample$cyl)
# 4 6 8
# 10 7 10
It is quite straightforward with base R as well, for example:
do.call(rbind, lapply(split(mtcars, mtcars$cyl), function(x) {
n <- nrow(x)
s <- min(n, 10)
x[sample(seq_len(n), s),]
}))
The rows in the output will be sorted by cyl
-- but row order would probably not matter anyway.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With