Data table operations with multiple group by variable sets

Tags:

data.table

I have a data.table that I would like to perform group-by operations on, but would like to retain the null variables and use different group-by variable sets.

A toy example:

library(data.table)
set.seed(1)
DT <- data.table(
        id = sample(c("US", "Other"), 25, replace = TRUE), 
        loc = sample(LETTERS[1:5], 25, replace = TRUE), 
        index = runif(25)
        )

I would like to find the sum of index by all combinations of the key variables (including the null set). The concept is analogous to "grouping sets" in Oracle SQL, here is an example of my current workaround:

rbind(
  DT[, list(id = "", loc = "", sindex = sum(index)), by = NULL],
  DT[, list(loc = "", sindex = sum(index)), by = "id"],
  DT[, list(id = "", sindex = sum(index)), by = "loc"],
  DT[, list(sindex = sum(index)), by = c("id", "loc")]
)[order(id, loc)]
       id loc      sindex
 1:           11.54218399
 2:         A  2.82172063
 3:         B  0.98639578
 4:         C  2.89149433
 5:         D  3.93292900
 6:         E  0.90964424
 7: Other      6.19514146
 8: Other   A  1.12107080
 9: Other   B  0.43809711
10: Other   C  2.80724742
11: Other   D  1.58392886
12: Other   E  0.24479728
13:    US      5.34704253
14:    US   A  1.70064983
15:    US   B  0.54829867
16:    US   C  0.08424691
17:    US   D  2.34900015
18:    US   E  0.66484697

Is there a preferred "data table" way to accomplish this?

560

asked Mar 05 '15 23:03

mlegge

1 Answers

As of this commit, this is now possible with the dev version of data.table with cube or groupingsets:

library("data.table")
# data.table 1.10.5 IN DEVELOPMENT built 2017-08-08 18:31:51 UTC
#   The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
#   Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
#   Release notes, videos and slides: http://r-datatable.com

cube(DT, list(sindex = sum(index)), by = c("id", "loc"))
#        id loc      sindex
#  1:    US   B  0.54829867
#  2:    US   A  1.70064983
#  3: Other   B  0.43809711
#  4: Other   E  0.24479728
#  5: Other   C  2.80724742
#  6: Other   A  1.12107080
#  7:    US   E  0.66484697
#  8:    US   D  2.34900015
#  9: Other   D  1.58392886
# 10:    US   C  0.08424691
# 11:    NA   B  0.98639578
# 12:    NA   A  2.82172063
# 13:    NA   E  0.90964424
# 14:    NA   C  2.89149433
# 15:    NA   D  3.93292900
# 16:    US  NA  5.34704253
# 17: Other  NA  6.19514146
# 18:    NA  NA 11.54218399

groupingsets(DT, j = list(sindex = sum(index)), by = c("id", "loc"), sets = list(character(), "id", "loc", c("id", "loc")))
#        id loc      sindex
#  1:    NA  NA 11.54218399
#  2:    US  NA  5.34704253
#  3: Other  NA  6.19514146
#  4:    NA   B  0.98639578
#  5:    NA   A  2.82172063
#  6:    NA   E  0.90964424
#  7:    NA   C  2.89149433
#  8:    NA   D  3.93292900
#  9:    US   B  0.54829867
# 10:    US   A  1.70064983
# 11: Other   B  0.43809711
# 12: Other   E  0.24479728
# 13: Other   C  2.80724742
# 14: Other   A  1.12107080
# 15:    US   E  0.66484697
# 16:    US   D  2.34900015
# 17: Other   D  1.58392886
# 18:    US   C  0.08424691

answered Oct 06 '22 05:10

mlegge

Related questions
                            
                                stratified 10 fold cross validation
                            
                                list files in ascending order
                            
                                ggplot/mapping US counties — problems with visualization shapes in R
                            
                                How to examine the code of a function in R that's object class sensitive
                            
                                Split date data (m/d/y) into 3 separate columns
                            
                                aggregate 1-minute data into 5-minute average data
                            
                                Specify position of geom_text by keywords like "top", "bottom", "left", "right", "center"
                            
                                R - check if NA exists in any column of r dataframe row, then if so remove that row [duplicate]
                            
                                Repeating elements in a vector with a for loop
                            
                                How to convert a csv list to a character vector in R
                            
                                How to calculate returns from a vector of prices?
                            
                                Is there a write function that corresponds to fread() in the data.table package? [duplicate]
                            
                                printing line breaks using sprintf - with shiny

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Data table operations with multiple group by variable sets

Tags:

r

data.table

mlegge

People also ask

1 Answers

mlegge

Recent Activity

Donate For Us