Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregating sub totals and grand totals with data.table

I've got a data.table in R:

library(data.table)
set.seed(1)
DT = data.table(
  group=sample(letters[1:2],100,replace=TRUE), 
  year=sample(2010:2012,100,replace=TRUE),
  v=runif(100))

Aggregating this data into a summary table by group and year is simple and elegant:

table <- DT[,mean(v),by='group, year']

However, aggregating this data into a summary table, including subtotals and grand totals, is a little more difficult, and a lot less elegant:

library(plyr)
yearTot <- DT[,list(mean(v),year='Total'),by='group']
groupTot <- DT[,list(mean(v),group='Total'),by='year']
Tot <- DT[,list(mean(v), year='Total', group='Total')]
table <- rbind.fill(table,yearTot,groupTot,Tot)
table$group[table$group==1] <- 'Total'
table$year[table$year==1] <- 'Total'

This yields:

table[order(table$group, table$year), ]

Is there a simple way to specify subtotals and grand totals with data.table, such as the margins=TRUE command for plyr? I would prefer to use data.table over plyr on my dataset, as it is a very large dataset that I already have in the data.table format.

like image 248
Zach Avatar asked Feb 16 '12 16:02

Zach


People also ask

What is the difference between subtotal and grand total?

A grand total is a total of all calculations on a report, per metric. A subtotal is a total of a specific subset of metric data, totaled at a level you select. Common subtotal functions include sum, count, minimum, maximum, average, mean, and median.

How do you use subtotals in SQL?

In order to calculate a subtotal in SQL query, we can use the ROLLUP extension of the GROUP BY statement. The ROLLUP extension allows us to generate hierarchical subtotal rows according to its input columns and it also adds a grand total row to the result set.


Video Answer


1 Answers

In recent devel data.table you can use new feature called "grouping sets" to produce sub totals:

library(data.table)
set.seed(1)
DT = data.table(
    group=sample(letters[1:2],100,replace=TRUE), 
    year=sample(2010:2012,100,replace=TRUE),
    v=runif(100))

cube(DT, mean(v), by=c("group","year"))
#    group year        V1
# 1:     a 2011 0.4176346
# 2:     b 2010 0.5231845
# 3:     b 2012 0.4306871
# 4:     b 2011 0.4997119
# 5:     a 2012 0.4227796
# 6:     a 2010 0.2926945
# 7:    NA 2011 0.4463616
# 8:    NA 2010 0.4278093
# 9:    NA 2012 0.4271160
#10:     a   NA 0.3901875
#11:     b   NA 0.4835788
#12:    NA   NA 0.4350153
cube(DT, mean(v), by=c("group","year"), id=TRUE)
#    grouping group year        V1
# 1:        0     a 2011 0.4176346
# 2:        0     b 2010 0.5231845
# 3:        0     b 2012 0.4306871
# 4:        0     b 2011 0.4997119
# 5:        0     a 2012 0.4227796
# 6:        0     a 2010 0.2926945
# 7:        2    NA 2011 0.4463616
# 8:        2    NA 2010 0.4278093
# 9:        2    NA 2012 0.4271160
#10:        1     a   NA 0.3901875
#11:        1     b   NA 0.4835788
#12:        3    NA   NA 0.4350153
like image 166
jangorecki Avatar answered Nov 05 '22 12:11

jangorecki