Consider a data frame dat with a group variable, and with one observation or more of x per group. Assume no ties in x within groups. One way to extract the observation that minimizes x within each group is to use dplyr::slice_min().
I like that slice_min() clearly expresses my intent, but it's often painfully slow, as below. I'd expect slower performance when arranging values of x within groups (more complex than finding the minimum). How come it's so much faster? Even my strange use of summarize() below is much faster!
More specifically, I'd like to maintain good performance with n groups and O(1) observations per group, as n tends to infinity.
library(dplyr)
library(microbenchmark)
# Simulate data. y is some other variable whose value we'd like to keep at the
# minimum of x.
set.seed(1)
n <- 5e3
k <- 1 + rpois(n, 1)
dat <- data.frame(
  group = rep(1:n, k), 
  x = rnorm(sum(k)),
  y = sample(letters, sum(k), replace = TRUE)
)
# Obtain observation that minimizes x within each group
microbenchmark(
  slice = dat |> 
    group_by(group) |> 
    slice_min(x) |> 
    ungroup(),
  arrange = dat |> 
    arrange(group, x) |> 
    filter(!duplicated(group)),
  summarize = dat |> 
    group_by(group) |> 
    summarize(i = which.min(x), across(everything(), \(v) v[i])) |> 
    select(!i),
  times = 10
)
Performance:
# Unit: milliseconds
#       expr        min         lq       mean    median         uq        max neval
#      slice 556.812802 625.876500 655.172451 632.45395 646.751201 909.931001    10
#    arrange   3.148302   3.209201   3.348941   3.34970   3.441501   3.663301    10
#  summarize  37.503501  37.946201  53.125181  38.17705  38.911001 127.843800    10
                A github issue related to the slow performance of slice_max with large numbers of groups has an author of dplyr suggesting a variation on your arrange approach that's even faster:
https://github.com/tidyverse/dplyr/issues/6783
 arrange2 = dat |> 
    arrange(x) |>
    distinct(group)
It sounds like the issue has to do with an inefficient handling of many groups.
Unit: milliseconds
      expr        min         lq       mean     median         uq        max neval cld
     slice 518.743380 525.143223 532.999647 529.736452 535.752691 569.156511    10   c
  arrange2   2.049079   2.100594   2.260899   2.171024   2.419913   2.622632    10 a  
   arrange   3.354592   3.556705   3.641422   3.629209   3.720807   3.966684    10 a  
 summarize  42.778845  43.855632  49.064502  45.582195  47.292923  65.927071    10  b 
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With