I would like to compute summary variables from multiple columns in a data frame. This is possible when typing out all the row names, but I would like to use starts_with() and similar functions. I.e.
df <- data.frame(A1 = rnorm(100, 0, 1),
A2 = rnorm(100, 0, 1),
A3 = rnorm(100, 0, 1),
B1 = rnorm(100, 0, 1),
B2 = rnorm(100, 0, 1))
What works:
library(tidyverse)
df %>% mutate(A = (A1 + A2 + A3)/3)
df %>% mutate(A = rowMeans(select(., A1:A3)))
However, the former gets annoying when summarising many variables, while the latter gets incredibly slow very quickly when summarising many rows. I suspect there must be a faster solution.
What does not work:
df %>% mutate(A = mean(A1:A3))
df %>% group_by(row_number()) %>% mutate(A = mean(A1:A3))
df %>% group_by(row_number()) %>% mutate(A = mean(starts_with("A")))
So my question is: Is there a way to use mean() etc. within mutate() to compute row means, ideally without having to spell out every single variable?
Use in your select
function the function starts_with
:
df %>% mutate(A = rowMeans(select(., starts_with("A"))))
If you search the help for select_helpers
you find several more.
Base R is the fastest in my tests.
I will use a bigger dataframe.
library(tidyverse)
library(microbenchmark)
library(ggplot2)
set.seed(1234)
n <- 1e4
df <- data.frame(A1 = rnorm(n, 0, 1),
A2 = rnorm(n, 0, 1),
A3 = rnorm(n, 0, 1),
B1 = rnorm(n, 0, 1),
B2 = rnorm(n, 0, 1))
mb <- microbenchmark(
m1 = df %>% mutate(A = (A1 + A2 + A3)/3),
m2 = df %>% mutate(A = rowMeans(select(., A1:A3))),
m3 = df %>% mutate(A = reduce(.[, grepl("^A", names(.))], `+`) / ncol(.[, grepl("^A", names(.))])),
m4 = rowMeans(df[, grep("^A", names(df))]),
m5 = df[, grep("^A", names(df))] %>% rowMeans()
)
mb
autoplot(mb)
The purely base R way is faster, followed by the base R subsetting/dplyr
pipe.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With