Why is slice_min() slow?

Question

Consider a data frame dat with a group variable, and with one observation or more of x per group. Assume no ties in x within groups. One way to extract the observation that minimizes x within each group is to use dplyr::slice_min().

I like that slice_min() clearly expresses my intent, but it's often painfully slow, as below. I'd expect slower performance when arranging values of x within groups (more complex than finding the minimum). How come it's so much faster? Even my strange use of summarize() below is much faster!

More specifically, I'd like to maintain good performance with n groups and O(1) observations per group, as n tends to infinity.

library(dplyr)
library(microbenchmark)

# Simulate data. y is some other variable whose value we'd like to keep at the
# minimum of x.
set.seed(1)
n <- 5e3
k <- 1 + rpois(n, 1)
dat <- data.frame(
  group = rep(1:n, k), 
  x = rnorm(sum(k)),
  y = sample(letters, sum(k), replace = TRUE)
)

# Obtain observation that minimizes x within each group
microbenchmark(
  slice = dat |> 
    group_by(group) |> 
    slice_min(x) |> 
    ungroup(),
  arrange = dat |> 
    arrange(group, x) |> 
    filter(!duplicated(group)),
  summarize = dat |> 
    group_by(group) |> 
    summarize(i = which.min(x), across(everything(), \(v) v[i])) |> 
    select(!i),
  times = 10
)

Performance:

# Unit: milliseconds
#       expr        min         lq       mean    median         uq        max neval
#      slice 556.812802 625.876500 655.172451 632.45395 646.751201 909.931001    10
#    arrange   3.148302   3.209201   3.348941   3.34970   3.441501   3.663301    10
#  summarize  37.503501  37.946201  53.125181  38.17705  38.911001 127.843800    10

Jon Spring · Accepted Answer

A github issue related to the slow performance of slice_max with large numbers of groups has an author of dplyr suggesting a variation on your arrange approach that's even faster: https://github.com/tidyverse/dplyr/issues/6783

 arrange2 = dat |> 
    arrange(x) |>
    distinct(group)

It sounds like the issue has to do with an inefficient handling of many groups.

Unit: milliseconds
      expr        min         lq       mean     median         uq        max neval cld
     slice 518.743380 525.143223 532.999647 529.736452 535.752691 569.156511    10   c
  arrange2   2.049079   2.100594   2.260899   2.171024   2.419913   2.622632    10 a  
   arrange   3.354592   3.556705   3.641422   3.629209   3.720807   3.966684    10 a  
 summarize  42.778845  43.855632  49.064502  45.582195  47.292923  65.927071    10  b

Why is slice_min() slow?

Tags:

sorting

dataframe

r

dplyr

nahp

1 Answers

Jon Spring

Recent Activity

Donate For Us

Why is slice_min() slow?

Tags:

sorting

dataframe

r

dplyr

nahp

1 Answers

Jon Spring

Related questions

Recent Activity

Donate For Us