Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr sample_n by group with unique size argument per group

Tags:

r

dplyr

I am trying to draw a stratified sample from a data set for which a variable exists that indicates how large the sample size per group should be.

library(dplyr)
# example data 
df <- data.frame(id = 1:15,
                 grp = rep(1:3,each = 5), 
                 frq = rep(c(3,2,4), each = 5))

In this example, grp refers to the group I want to sample by and frq is the sample size specificied for that group.

Using split, I came up with this possible solution, which gives the desired result but seems rather inefficient :

s <- split(df, df$grp)
lapply(s,function(x) sample_n(x, size = unique(x$frq))) %>% 
      do.call(what = rbind)

Is there a way using just dplyr's group_by and sample_n to do this?

My first thought was:

df %>% group_by(grp) %>% sample_n(size = frq)

but this gives the error:

Error in is_scalar_integerish(size) : object 'frq' not found

like image 340
Fred Avatar asked Jan 28 '23 16:01

Fred


2 Answers

This works:

df %>% group_by(grp) %>% sample_n(frq[1])

# A tibble: 9 x 3
# Groups:   grp [3]
     id   grp   frq
  <int> <int> <dbl>
1     3     1     3
2     4     1     3
3     2     1     3
4     6     2     2
5     8     2     2
6    13     3     4
7    14     3     4
8    12     3     4
9    11     3     4

Not sure why it didn't work when you tried it.

like image 186
thc Avatar answered Feb 03 '23 07:02

thc


library(tidyverse)

# example data 
df <- data.frame(id = 1:15,
                 grp = rep(1:3,each = 5), 
                 frq = rep(c(3,2,4), each = 5))

set.seed(22)

df %>%
  group_by(grp) %>%   # for each group
  nest() %>%          # nest data
  mutate(v = map(data, ~sample_n(data.frame(id=.$id), unique(.$frq)))) %>%  # sample using id values and (unique) frq value
  unnest(v)           # unnest the sampled values

# # A tibble: 9 x 2
#     grp    id
#   <int> <int>
# 1     1     2
# 2     1     5
# 3     1     3
# 4     2     8
# 5     2     9
# 6     3    14
# 7     3    13
# 8     3    15
# 9     3    11

Function sample_n works if you pass as inputs a data frame of ids (not a vector of ids) and one frequency value (for each group).

An alternative version using map2 and generating the inputs for sample_n in advance:

df %>%
  group_by(grp) %>%                                 # for every group
  summarise(d = list(data.frame(id=id)),            # create a data frame of ids
            frq = unique(frq)) %>%                  # get the unique frq value
  mutate(v = map2(d, frq, ~sample_n(.x, .y))) %>%   # sample using data frame of ids and frq value
  unnest(v) %>%                                     # unnest sampled values
  select(-frq)                                      # remove frq column (if needed)
like image 44
AntoniosK Avatar answered Feb 03 '23 07:02

AntoniosK