The underlying topic of the question is really about the point buy system in D&D 5E, but while working on it, I notice that c_across()
slowed things down enormously.
The approach
die_value <- 8L:15L
point_value <- c(0L, 1L, 2L, 3L, 4L, 5L, 7L, 9L)
determine_points <- function(n) {
sapply(n, function(x) point_value[which(die_value == x)])
}
sum27 <- crossing(
d1 = die_value,
d2 = die_value,
d3 = die_value,
d4 = die_value,
d5 = die_value,
d6 = die_value
) %>%
mutate(across(d1:d6, ~determine_points(.x), .names = "{col}_points")) %>% # see next blocks for options
As a first try for summing the d[n]_points
columns row-wise to total_points
, I tried to do the obvious rowwise %>% sum(c_across)
route
rowwise %>% mutate(total_points = sum(c_across(d1_points:d6_points)))
However, while running this it took forever. The more explicit version was really fast in comparison.
mutate(total_points = d1_points + d2_points + d3_points + d4_points + d5_points + d6_points)
Balancing speed vs convenience is always a challenge, but here the difference is huge.
Is there a better way to sum up certain rows of a dataframe, while maintaining a little agnosticity about the number of columns that need to be added?
We could use rowSums
instead of sum
with across
instead of rowwise
+ c_across
. With rowSums
, we can also remove the NA
if present
library(dplyr)
...
%>%
mutate(total_points = rowSums(across(d1_points:d6_points), na.rm = TRUE))
Or in the newer versions (dplyr version >= 1.1.0
), use pick
to select the columns as we are applying the function rowSums
on the whole subset of columns rather than on each column itself. According to ?pick
With pick(), you typically apply a function to the full data frame.
With across(), you typically apply a function to each column.
%>%
mutate(total_points = rowSums(pick(ends_with("_points")), na.rm = TRUE))
If there are no NAs, an option is also to reduce
with +
library(purrr)
...
%>%
mutate(total_points = reduce(across(1_points:d6_points), `+`))
> dim(sum27)
[1] 262144 12
> system.time(sum27 %>%
+ mutate(total_points = rowSums(across(d1_points:d6_points),
na.rm = TRUE)))
user system elapsed
0.021 0.006 0.027
# stopped the timing
> system.time(sum27 %>% rowwise %>%
+ mutate(total_points = sum(c_across(d1_points:d6_points)))
+ )
#Timing stopped at: 122.3 0.892 124.6
> system.time(sum27 %>%
+ mutate(total_points = reduce(across(d1_points:d6_points), `+`)))
user system elapsed
0.030 0.010 0.039
Maybe we're splitting hairs here, but it's slightly faster with data.table
.
library(dplyr)
library(data.table)
library(microbenchmark)
library(tidyr)
library(stringr)
die_value <- 8L:15L
point_value <- c(0L, 1L, 2L, 3L, 4L, 5L, 7L, 9L)
determine_points <- function(n) {
sapply(n, function(x) point_value[which(die_value == x)])
}
dat <- sum27 <- crossing(
d1 = die_value,
d2 = die_value,
d3 = die_value,
d4 = die_value,
d5 = die_value,
d6 = die_value
) %>%
mutate(across(d1:d6, ~determine_points(.x), .names = "{col}_points"))
f_dplyr <- function(){
z <- sum27%>%
mutate(total_points = rowSums(pick(d1_points:d6_points),na.rm = TRUE))
}
setDT(dat)
ptcols <- colnames(sum27)[str_detect(colnames(sum27), "points$")]
f_dt <- function(){
z <- dat[, total_points:=rowSums(.SD), .SDcols = ptcols]
}
microbenchmark(f_dplyr(), f_dt(), times = 500)
#> Warning in microbenchmark(f_dplyr(), f_dt(), times = 500): less accurate
#> nanosecond times to avoid potential integer overflows
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> f_dplyr() 4.488803 4.707517 5.770832 4.944641 6.483268 34.66739 500 a
#> f_dt() 3.091359 3.213519 4.418006 3.349700 5.003578 32.35970 500 b
Created on 2023-02-27 with reprex v2.0.2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With