Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R tidy row means from subset of columns [duplicate]

Tags:

r

row

tidyverse

I would like to compute summary variables from multiple columns in a data frame. This is possible when typing out all the row names, but I would like to use starts_with() and similar functions. I.e.

df <- data.frame(A1 = rnorm(100, 0, 1),
                 A2 = rnorm(100, 0, 1),
                 A3 = rnorm(100, 0, 1),
                 B1 = rnorm(100, 0, 1),
                 B2 = rnorm(100, 0, 1))

What works:

library(tidyverse)
df %>% mutate(A = (A1 + A2 + A3)/3)
df %>% mutate(A = rowMeans(select(., A1:A3)))

However, the former gets annoying when summarising many variables, while the latter gets incredibly slow very quickly when summarising many rows. I suspect there must be a faster solution.

What does not work:

df %>% mutate(A = mean(A1:A3))
df %>% group_by(row_number()) %>% mutate(A = mean(A1:A3))
df %>% group_by(row_number()) %>% mutate(A = mean(starts_with("A")))

So my question is: Is there a way to use mean() etc. within mutate() to compute row means, ideally without having to spell out every single variable?

like image 780
simoncolumbus Avatar asked Dec 17 '22 18:12

simoncolumbus


2 Answers

Use in your select function the function starts_with:

df %>% mutate(A = rowMeans(select(., starts_with("A"))))

If you search the help for select_helpers you find several more.

like image 183
ricoderks Avatar answered Jan 07 '23 14:01

ricoderks


Base R is the fastest in my tests.
I will use a bigger dataframe.

library(tidyverse)
library(microbenchmark)
library(ggplot2)

set.seed(1234)

n <- 1e4
df <- data.frame(A1 = rnorm(n, 0, 1),
                 A2 = rnorm(n, 0, 1),
                 A3 = rnorm(n, 0, 1),
                 B1 = rnorm(n, 0, 1),
                 B2 = rnorm(n, 0, 1))

mb <- microbenchmark(
  m1 = df %>% mutate(A = (A1 + A2 + A3)/3),
  m2 = df %>% mutate(A = rowMeans(select(., A1:A3))),
  m3 = df %>% mutate(A = reduce(.[, grepl("^A", names(.))], `+`) / ncol(.[, grepl("^A", names(.))])),
  m4 = rowMeans(df[, grep("^A", names(df))]),
  m5 = df[, grep("^A", names(df))] %>% rowMeans()
)

mb

autoplot(mb)

The purely base R way is faster, followed by the base R subsetting/dplyr pipe.

like image 28
Rui Barradas Avatar answered Jan 07 '23 14:01

Rui Barradas