Consider a data frame dat with a group variable, and with one observation or more of x per group. Assume no ties in x within groups. One way to extract the observation that minimizes x within each group is to use dplyr::slice_min().
I like that slice_min() clearly expresses my intent, but it's often painfully slow, as below. I'd expect slower performance when arranging values of x within groups (more complex than finding the minimum). How come it's so much faster? Even my strange use of summarize() below is much faster!
More specifically, I'd like to maintain good performance with n groups and O(1) observations per group, as n tends to infinity.
library(dplyr)
library(microbenchmark)
# Simulate data. y is some other variable whose value we'd like to keep at the
# minimum of x.
set.seed(1)
n <- 5e3
k <- 1 + rpois(n, 1)
dat <- data.frame(
group = rep(1:n, k),
x = rnorm(sum(k)),
y = sample(letters, sum(k), replace = TRUE)
)
# Obtain observation that minimizes x within each group
microbenchmark(
slice = dat |>
group_by(group) |>
slice_min(x) |>
ungroup(),
arrange = dat |>
arrange(group, x) |>
filter(!duplicated(group)),
summarize = dat |>
group_by(group) |>
summarize(i = which.min(x), across(everything(), \(v) v[i])) |>
select(!i),
times = 10
)
Performance:
# Unit: milliseconds
# expr min lq mean median uq max neval
# slice 556.812802 625.876500 655.172451 632.45395 646.751201 909.931001 10
# arrange 3.148302 3.209201 3.348941 3.34970 3.441501 3.663301 10
# summarize 37.503501 37.946201 53.125181 38.17705 38.911001 127.843800 10
A github issue related to the slow performance of slice_max with large numbers of groups has an author of dplyr suggesting a variation on your arrange approach that's even faster:
https://github.com/tidyverse/dplyr/issues/6783
arrange2 = dat |>
arrange(x) |>
distinct(group)
It sounds like the issue has to do with an inefficient handling of many groups.
Unit: milliseconds
expr min lq mean median uq max neval cld
slice 518.743380 525.143223 532.999647 529.736452 535.752691 569.156511 10 c
arrange2 2.049079 2.100594 2.260899 2.171024 2.419913 2.622632 10 a
arrange 3.354592 3.556705 3.641422 3.629209 3.720807 3.966684 10 a
summarize 42.778845 43.855632 49.064502 45.582195 47.292923 65.927071 10 b
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With