R: c_across() very slow - is there a faster but elegant approach?

Question

The underlying topic of the question is really about the point buy system in D&D 5E, but while working on it, I notice that c_across() slowed things down enormously.

The approach

die_value <- 8L:15L
point_value <- c(0L, 1L, 2L, 3L, 4L, 5L, 7L, 9L)

determine_points <- function(n) {
    sapply(n, function(x) point_value[which(die_value == x)])
}

sum27 <- crossing(
    d1 = die_value,
    d2 = die_value,
    d3 = die_value,
    d4 = die_value,
    d5 = die_value,
    d6 = die_value
) %>%
    mutate(across(d1:d6, ~determine_points(.x), .names = "{col}_points")) %>% # see next blocks for options

As a first try for summing the d[n]_points columns row-wise to total_points, I tried to do the obvious rowwise %>% sum(c_across) route

rowwise %>% mutate(total_points = sum(c_across(d1_points:d6_points)))

However, while running this it took forever. The more explicit version was really fast in comparison.

mutate(total_points = d1_points + d2_points + d3_points + d4_points + d5_points + d6_points)

Balancing speed vs convenience is always a challenge, but here the difference is huge.

Is there a better way to sum up certain rows of a dataframe, while maintaining a little agnosticity about the number of columns that need to be added?

akrun · Accepted Answer

We could use rowSums instead of sum with across instead of rowwise + c_across. With rowSums, we can also remove the NA if present

library(dplyr)
...
%>%
mutate(total_points = rowSums(across(d1_points:d6_points), na.rm = TRUE))

Or in the newer versions (dplyr version >= 1.1.0), use pick to select the columns as we are applying the function rowSums on the whole subset of columns rather than on each column itself. According to ?pick

With pick(), you typically apply a function to the full data frame.

With across(), you typically apply a function to each column.

%>%
  mutate(total_points = rowSums(pick(ends_with("_points")), na.rm = TRUE))

If there are no NAs, an option is also to reduce with +

library(purrr)
...
%>%
   mutate(total_points = reduce(across(1_points:d6_points), `+`))

Benchmarks

> dim(sum27)
[1] 262144     12
> system.time(sum27 %>%
+ mutate(total_points = rowSums(across(d1_points:d6_points),
   na.rm = TRUE)))
   user  system elapsed 
  0.021   0.006   0.027 
# stopped the timing
> system.time(sum27 %>% rowwise %>% 
+ mutate(total_points = sum(c_across(d1_points:d6_points)))
+ )
#Timing stopped at: 122.3 0.892 124.6
> system.time(sum27 %>%
+ mutate(total_points = reduce(across(d1_points:d6_points), `+`)))
   user  system elapsed 
  0.030   0.010   0.039

DaveArmstrong · Answer

Maybe we're splitting hairs here, but it's slightly faster with data.table.

library(dplyr)
library(data.table)
library(microbenchmark)
library(tidyr)
library(stringr)
die_value <- 8L:15L
point_value <- c(0L, 1L, 2L, 3L, 4L, 5L, 7L, 9L)

determine_points <- function(n) {
  sapply(n, function(x) point_value[which(die_value == x)])
}

dat <- sum27 <- crossing(
  d1 = die_value,
  d2 = die_value,
  d3 = die_value,
  d4 = die_value,
  d5 = die_value,
  d6 = die_value
) %>%
  mutate(across(d1:d6, ~determine_points(.x), .names = "{col}_points"))

f_dplyr <- function(){
z <- sum27%>%
  mutate(total_points = rowSums(pick(d1_points:d6_points),na.rm = TRUE))
}

setDT(dat)
ptcols <- colnames(sum27)[str_detect(colnames(sum27), "points$")]

f_dt <- function(){
  z <- dat[, total_points:=rowSums(.SD), .SDcols = ptcols]
}

microbenchmark(f_dplyr(), f_dt(), times = 500)
#> Warning in microbenchmark(f_dplyr(), f_dt(), times = 500): less accurate
#> nanosecond times to avoid potential integer overflows
#> Unit: milliseconds
#>       expr      min       lq     mean   median       uq      max neval cld
#>  f_dplyr() 4.488803 4.707517 5.770832 4.944641 6.483268 34.66739   500  a 
#>     f_dt() 3.091359 3.213519 4.418006 3.349700 5.003578 32.35970   500   b

^{Created on 2023-02-27 with reprex v2.0.2}

R: c_across() very slow - is there a faster but elegant approach?

Tags:

r

dplyr

Rob Hanssen

2 Answers

Benchmarks

akrun

DaveArmstrong

Recent Activity

Donate For Us

R: c_across() very slow - is there a faster but elegant approach?

Tags:

r

dplyr

Rob Hanssen

2 Answers

Benchmarks

akrun

DaveArmstrong

Related questions

Recent Activity

Donate For Us