I am looking to generate a mean and variance value for every row in a numeric tibble. With my existing code, what I thought to be a very dplyr appropriate solution, it takes a number of hours to complete for 50,000 rows of about 35 columns.
Is there a way to speed up this operation using only dplyr? I know apply and purrr are options, but I am mostly curious if there is something about dplyr I'm overlooking when performing a large series of calculations like this.
Reproducible example:
library(tidyverse)
library(vroom)
gen_tbl(50000, cols = 40,
col_types = paste0(rep("d", 40), collapse = "")) %>%
rowwise() %>%
mutate(mean = mean(c_across()),
var = var(c_across()))
My suspicion lies with rowwise() but I am interested if there is a more nuanced way to solving this with dplyr or if it is just not a problem dplyr is good at.
It seems like the processing time for the rowwise approach explodes quadratically:

Pivoting longer makes the calculation about 300x faster. For 50k rows, the code below took 1.2 seconds, compared to 372 seconds for the rowwise approach.
df %>%
mutate(row = row_number()) %>%
tidyr::pivot_longer(-row) %>%
group_by(row) %>%
summarize(mean = mean(value),
var = var(value)) %>%
bind_cols(df, .)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With