Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group by variable in data.table and carry on other variables

Tags:

r

data.table

I am working on some summaries for financial datasets and I would like to sort the summary in regard to a certain criterion, but without loosing the remaining summary values in a row. Here is a simple example:

set.seed(1)
tseq <- seq(Sys.time(), length.out = 36, by = "mins")
dt <- data.table(TM_STMP = tseq, COMP = rep(c(rep("A", 4), rep("B", 4), rep("C", 4)), 3), SEC = rep(letters[1:12],3), VOL = rpois(36, 3e+6))
dt2 <- dt[, list(SUM = sum(VOL), MEAN = mean(VOL)), by = list(COMP, SEC)]
dt2
   COMP SEC     SUM    MEAN
1:    A   a 9000329 3000110
2:    A   b 9001274 3000425
3:    A   c 9003505 3001168
4:    A   d 9002138 3000713

Now I would like to get the SEC per COMP with highest VOL:

dt3 <- dt2[, list(SUM = max(SUM)), by = list(COMP)]
dt3
   COMP     SUM
1:    A 9003505
2:    B 9002888
3:    C 9005042

This gives me what I want, but I would like to keep the other values in the specific rows (SEC and MEAN) such that it looks like this (made by hand):

   COMP     SUM SEC    MEAN
1:    A 9003505   c 3001168
2:    B 9002888   f 3000963  
3:    C 9005042   k 3001681

How can I achieve this?

like image 833
Simon Z. Avatar asked Jan 12 '23 02:01

Simon Z.


2 Answers

If you are looking for the SEC and the MEAN corresponding to max of SUM:

 dt3 <- dt2[, list(SUM = max(SUM),SEC=SEC[which.max(SUM)],MEAN=MEAN[which.max(SUM)]), by = list(COMP)]
> dt3
   COMP     SUM SEC    MEAN
1:    A 9003110   a 3001037
2:    B 9000814   e 2999612
3:    C 9002707   i 2999741

Edit: This'll be faster:

dt2[dt2[, .I[which.max(SUM)], by = list(COMP)]$V1]
like image 133
Metrics Avatar answered Jan 28 '23 00:01

Metrics


Another way to do this would be to setkey of the data.table to: COMP, SUM and then use mult="last" as follows:

setkey(dt2, COMP, SUM)
dt2[J(unique(COMP)), mult="last"]
#    COMP SEC     SUM    MEAN
# 1:    A   c 9002500 3000833
# 2:    B   g 9003312 3001104
# 3:    C   i 9000058 3000019

Edit: To answer to Simon's benchmarking about speed differences between this and @metrics':

set.seed(45)
N <- 1e6
tseq <- seq(Sys.time(), length.out = N, by = "mins")

ff <- function(x) paste(sample(letters, x, TRUE), collapse="")
val1 <- unique(unlist(replicate(1e5, ff(8), simplify=FALSE)))
val2 <- unique(unlist(replicate(1e5, ff(12), simplify=FALSE)))

dt <- data.table(TM_STMP = tseq, COMP = rep(val1, each=100), SEC = rep(val2, each=100), VOL = rpois(1e6, 3e+6))
dt2 <- dt[, list(SUM = sum(VOL), MEAN = mean(VOL)), by = list(COMP, SEC)]

require(microbenchmark)

metrics <- function(x=copy(dt2)) {
    x[, list(SUM = max(SUM),SEC=SEC[which.max(SUM)],MEAN=MEAN[which.max(SUM)]), by = list(COMP)]
}

arun <- function(x=copy(dt2)) {
    setkey(x, COMP, SUM)
    x[J(unique(COMP)), mult="last"]
}

microbenchmark(ans1 <- metrics(dt2), ans2 <- arun(dt2), times=20)
# Unit: milliseconds
#                  expr      min       lq   median       uq       max neval
#  ans1 <- metrics(dt2) 749.0001 804.0651 838.0750 882.3869 1053.3389    20
#     ans2 <- arun(dt2) 301.7696 321.6619 342.4779 359.9343  392.5902    20

setkey(ans1, COMP, SEC)
setkey(ans2, COMP, SEC)
setcolorder(ans1, names(ans2))

identical(ans1, ans2) # [1] TRUE
like image 37
Arun Avatar answered Jan 28 '23 01:01

Arun