dplyr is amazingly fast, but I wonder if I'm missing something: is it possible summarise over several variables. For example:
library(dplyr) library(reshape2) (df=dput(structure(list(sex = structure(c(1L, 1L, 2L, 2L), .Label = c("boy", "girl"), class = "factor"), age = c(52L, 58L, 40L, 62L), bmi = c(25L, 23L, 30L, 26L), chol = c(187L, 220L, 190L, 204L)), .Names = c("sex", "age", "bmi", "chol"), row.names = c(NA, -4L), class = "data.frame"))) sex age bmi chol 1 boy 52 25 187 2 boy 58 23 220 3 girl 40 30 190 4 girl 62 26 204 dg=group_by(df,sex)
With this small dataframe, it's easy to write
summarise(dg,mean(age),mean(bmi),mean(chol))
And I know that to get what I want, I could melt, get the means, and then dcast such as
dm=melt(df, id.var='sex') dmg=group_by(dm, sex, variable); x=summarise(dmg, means=mean(value)) dcast(x, sex~variable)
But what if I have >20 variables and a very large number of rows. Is there anything similar to .SD in data.table that would allow me to take the means of all variables in the grouped data frame? Or, is it possible to somehow use lapply on the grouped data frame?
Thanks for any help
summarise() creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input.
Groupby Function in R – group_by is used to group the dataframe in R. Dplyr package in R is provided with group_by() function which groups the dataframe by multiple columns with mean, sum and other functions like count, maximum and minimum.
As has been mentioned by several folks, mutate_each()
and summarise_each()
are deprecated in favour of the new across()
function.
Answer as of dplyr
version 1.0.5:
df %>% group_by(sex) %>% summarise(across(everything(), mean))
Original answer:
dplyr
now has summarise_each
:
df %>% group_by(sex) %>% summarise_each(funs(mean))
The data.table
idiom is lapply(.SD, mean)
, which is
DT <- data.table(df) DT[, lapply(.SD, mean), by = sex] # sex age bmi chol # 1: boy 55 24 203.5 # 2: girl 51 28 197.0
I'm not sure of a dplyr
idiom for the same thing, but you can do something like
dg <- group_by(df, sex) # the names of the columns you want to summarize cols <- names(dg)[-1] # the dots component of your call to summarise dots <- sapply(cols ,function(x) substitute(mean(x), list(x=as.name(x)))) do.call(summarise, c(list(.data=dg), dots)) # Source: local data frame [2 x 4] # sex age bmi chol # 1 boy 55 24 203.5 # 2 girl 51 28 197.0
Note that there is a github issue #178 to efficienctly implement the plyr
idiom colwise
in dplyr
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With