Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data table operations with multiple group by variable sets

Tags:

r

data.table

I have a data.table that I would like to perform group-by operations on, but would like to retain the null variables and use different group-by variable sets.

A toy example:

library(data.table)
set.seed(1)
DT <- data.table(
        id = sample(c("US", "Other"), 25, replace = TRUE), 
        loc = sample(LETTERS[1:5], 25, replace = TRUE), 
        index = runif(25)
        )

I would like to find the sum of index by all combinations of the key variables (including the null set). The concept is analogous to "grouping sets" in Oracle SQL, here is an example of my current workaround:

rbind(
  DT[, list(id = "", loc = "", sindex = sum(index)), by = NULL],
  DT[, list(loc = "", sindex = sum(index)), by = "id"],
  DT[, list(id = "", sindex = sum(index)), by = "loc"],
  DT[, list(sindex = sum(index)), by = c("id", "loc")]
)[order(id, loc)]
       id loc      sindex
 1:           11.54218399
 2:         A  2.82172063
 3:         B  0.98639578
 4:         C  2.89149433
 5:         D  3.93292900
 6:         E  0.90964424
 7: Other      6.19514146
 8: Other   A  1.12107080
 9: Other   B  0.43809711
10: Other   C  2.80724742
11: Other   D  1.58392886
12: Other   E  0.24479728
13:    US      5.34704253
14:    US   A  1.70064983
15:    US   B  0.54829867
16:    US   C  0.08424691
17:    US   D  2.34900015
18:    US   E  0.66484697

Is there a preferred "data table" way to accomplish this?

like image 560
mlegge Avatar asked Mar 05 '15 23:03

mlegge


People also ask

What is .SD data table?

SD is a data. table containing the subset of x's data for each group, excluding the group column(s).

Is data table DT == true?

data. table(DT) is TRUE. To better description, I put parts of my original code here. So you may understand where goes wrong.


1 Answers

As of this commit, this is now possible with the dev version of data.table with cube or groupingsets:

library("data.table")
# data.table 1.10.5 IN DEVELOPMENT built 2017-08-08 18:31:51 UTC
#   The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
#   Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
#   Release notes, videos and slides: http://r-datatable.com

cube(DT, list(sindex = sum(index)), by = c("id", "loc"))
#        id loc      sindex
#  1:    US   B  0.54829867
#  2:    US   A  1.70064983
#  3: Other   B  0.43809711
#  4: Other   E  0.24479728
#  5: Other   C  2.80724742
#  6: Other   A  1.12107080
#  7:    US   E  0.66484697
#  8:    US   D  2.34900015
#  9: Other   D  1.58392886
# 10:    US   C  0.08424691
# 11:    NA   B  0.98639578
# 12:    NA   A  2.82172063
# 13:    NA   E  0.90964424
# 14:    NA   C  2.89149433
# 15:    NA   D  3.93292900
# 16:    US  NA  5.34704253
# 17: Other  NA  6.19514146
# 18:    NA  NA 11.54218399

groupingsets(DT, j = list(sindex = sum(index)), by = c("id", "loc"), sets = list(character(), "id", "loc", c("id", "loc")))
#        id loc      sindex
#  1:    NA  NA 11.54218399
#  2:    US  NA  5.34704253
#  3: Other  NA  6.19514146
#  4:    NA   B  0.98639578
#  5:    NA   A  2.82172063
#  6:    NA   E  0.90964424
#  7:    NA   C  2.89149433
#  8:    NA   D  3.93292900
#  9:    US   B  0.54829867
# 10:    US   A  1.70064983
# 11: Other   B  0.43809711
# 12: Other   E  0.24479728
# 13: Other   C  2.80724742
# 14: Other   A  1.12107080
# 15:    US   E  0.66484697
# 16:    US   D  2.34900015
# 17: Other   D  1.58392886
# 18:    US   C  0.08424691
like image 67
mlegge Avatar answered Oct 06 '22 05:10

mlegge