Here is the data:
df <-
data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4),
value = LETTERS[1:20])
I need to randomly select sequences of four values from each group with dplyr. Selected values should be in the same order as in the data, and there should be no gaps between them.
Desired result may look like this:
group value
1 1 A
2 1 B
3 1 C
4 1 D
6 2 F
7 2 G
8 2 H
9 2 I
11 3 K
12 3 L
13 3 M
14 3 N
17 4 Q
18 4 R
19 4 S
20 4 T
group value
1 1 A
2 1 B
3 1 C
4 1 D
5 2 E
6 2 F
7 2 G
8 2 H
10 3 J
11 3 K
12 3 L
13 3 M
17 4 Q
18 4 R
19 4 S
20 4 T
This is where I am in solving this:
set.seed(23)
df %>%
group_by(group) %>%
mutate(selected = sample(0:1, size = n(), replace = TRUE)) %>%
filter(selected == 1)
However, I couldn't figure out how to generate exactly 4 ones in a row, with zeroes before or after them.
We can sample the number of rows (minus three) in the group, size one, and add 0:3 to that to select which rows we retain.
set.seed(42)
df %>%
group_by(group) %>%
filter(row_number() %in% c(sample(max(1, n()-3), size=1) + 0:3)) %>%
ungroup()
# # A tibble: 16 × 2
# group value
# <dbl> <chr>
# 1 1 A
# 2 1 B
# 3 1 C
# 4 1 D
# 5 2 E
# 6 2 F
# 7 2 G
# 8 2 H
# 9 3 J
# 10 3 K
# 11 3 L
# 12 3 M
# 13 4 Q
# 14 4 R
# 15 4 S
# 16 4 T
Safety steps here:
max(1, n()-3) makes sure that we don't attempt to sample negative (or zero) row numbersrow_number() %in% ... will never try to index rows that don't exist, even if c(sample(.) + 0:3) might suggest more rows than exist.You can try a bit with embed (but not as efficient as the answer by @r2evans)
df %>%
filter(
value %in% embed(value, 4)[sample.int(n() - 3, 1), ],
.by = group
)
or
df %>%
summarise(
value = list(embed(value, 4)[sample.int(n() - 3, 1), 4:1]),
.by = group
) %>%
unnest(value)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With