dplyr: colSums on sub-grouped (group_by) data frames: elegantly

Question

I have a very large dataframe (265,874 x 30), with three sensible groups: an age category (1-6), dates (5479 such) and geographic locality (4 total). Each record consists of a choice from each of these, plus 27 count variables. I want to group by each of the grouping variables, then take a colSums on the resulting sub-grouped 27 variables. I've been trying to use dplyr (v0.2) to do it, because doing it manually ends up setting up a lot of redundant things (or resorting to a loop for iterating across the grouping options, for lack of an elegant solution).

Example code:

countData <- sample(0:10, 2000, replace = TRUE)
dates <- sample(seq(as.Date("2010/1/1"), as.Date("2010/01/30"), "days"), 200, replace = TRUE)
locality <- sample(1:2, 2000, replace = TRUE)
ageCat <- sample(1:2, 2000, replace = TRUE)
sampleDF <- data.frame(dates, locality, ageCat, matrix(countData, nrow = 200, ncol = 10))

then what I'd like to do is ...

library("dplyr")
sampleDF %.% group_by(locality, ageCat, dates) %.% do(colSums(.[, -(1:3)]))

but this doesn't quite work, as the results from colSums() aren't data frames. If I cast it, it works:

sampleDF %.% group_by(locality, ageCat, dates) %.% do(data.frame(matrix(colSums(.[, -(1:3)]), nrow = 1, ncol = 10)))

but the final do(...) bit seems very clunky.

Any thoughts on how to do this more elegantly or effectively? I guess the question comes down to: how best to use the do() function and the . operator to summarize a data frame via colSums.

Note: the do(.) operator only applies to dplyr 0.2, so you need to grab it from GitHub (link), not from CRAN.

Edit: results from suggestions

Three solutions:

My suggestion in post: elapsed, 146.765 seconds.
@joran's suggestion below: 6.902 seconds
@eddi's suggestion in the comments, using data.table: 6.715 seconds.

I didn't bother to replicate, just used system.time() to get a rough gauge. From the looks of it, dplyr and data.table perform approximately the same on my data set, and both are significantly faster when used properly than the hack solution I came up with yesterday.

Martin Gal · Accepted Answer

The methods summarize_all and summarize_at mentioned in Hack-R's answer from 2018 have been superseded.

Instead, please use summarize()/summarise() combined with across().

joran · Answer

Unless I'm missing something, this seems like a job for summarise_each (a sort of colwise analogue from plyr):

sampleDF %.% group_by(locality, ageCat, dates) %.% summarise_each(funs(sum))

The grouping column are not included in the summarizing function by default, and you can select only a subset of columns to apply the functions to using the same technique as when using select.

(summarise_each is in version 0.2 of dplyr but not in 0.1.3, as far as I know.)

Hack-R · Answer

The method summarise_each mentioned in joran's answer from 2014 has been deprecated.

Instead, please use summarize_all() or summarize_at().

dplyr: colSums on sub-grouped (group_by) data frames: elegantly

Tags:

r

dplyr

Wesley Burr

3 Answers

Martin Gal

joran

Hack-R

Recent Activity

Donate For Us

dplyr: colSums on sub-grouped (group_by) data frames: elegantly

Tags:

r

dplyr

Wesley Burr

3 Answers

Martin Gal

joran

Hack-R

Related questions

Recent Activity

Donate For Us