I am working on some summaries for financial datasets and I would like to sort the summary in regard to a certain criterion, but without loosing the remaining summary values in a row. Here is a simple example:
set.seed(1)
tseq <- seq(Sys.time(), length.out = 36, by = "mins")
dt <- data.table(TM_STMP = tseq, COMP = rep(c(rep("A", 4), rep("B", 4), rep("C", 4)), 3), SEC = rep(letters[1:12],3), VOL = rpois(36, 3e+6))
dt2 <- dt[, list(SUM = sum(VOL), MEAN = mean(VOL)), by = list(COMP, SEC)]
dt2
COMP SEC SUM MEAN
1: A a 9000329 3000110
2: A b 9001274 3000425
3: A c 9003505 3001168
4: A d 9002138 3000713
Now I would like to get the SEC per COMP with highest VOL:
dt3 <- dt2[, list(SUM = max(SUM)), by = list(COMP)]
dt3
COMP SUM
1: A 9003505
2: B 9002888
3: C 9005042
This gives me what I want, but I would like to keep the other values in the specific rows (SEC and MEAN) such that it looks like this (made by hand):
COMP SUM SEC MEAN
1: A 9003505 c 3001168
2: B 9002888 f 3000963
3: C 9005042 k 3001681
How can I achieve this?
If you are looking for the SEC and the MEAN corresponding to max of SUM:
dt3 <- dt2[, list(SUM = max(SUM),SEC=SEC[which.max(SUM)],MEAN=MEAN[which.max(SUM)]), by = list(COMP)]
> dt3
COMP SUM SEC MEAN
1: A 9003110 a 3001037
2: B 9000814 e 2999612
3: C 9002707 i 2999741
Edit: This'll be faster:
dt2[dt2[, .I[which.max(SUM)], by = list(COMP)]$V1]
Another way to do this would be to setkey
of the data.table to: COMP, SUM
and then use mult="last"
as follows:
setkey(dt2, COMP, SUM)
dt2[J(unique(COMP)), mult="last"]
# COMP SEC SUM MEAN
# 1: A c 9002500 3000833
# 2: B g 9003312 3001104
# 3: C i 9000058 3000019
Edit: To answer to Simon's benchmarking about speed differences between this and @metrics':
set.seed(45)
N <- 1e6
tseq <- seq(Sys.time(), length.out = N, by = "mins")
ff <- function(x) paste(sample(letters, x, TRUE), collapse="")
val1 <- unique(unlist(replicate(1e5, ff(8), simplify=FALSE)))
val2 <- unique(unlist(replicate(1e5, ff(12), simplify=FALSE)))
dt <- data.table(TM_STMP = tseq, COMP = rep(val1, each=100), SEC = rep(val2, each=100), VOL = rpois(1e6, 3e+6))
dt2 <- dt[, list(SUM = sum(VOL), MEAN = mean(VOL)), by = list(COMP, SEC)]
require(microbenchmark)
metrics <- function(x=copy(dt2)) {
x[, list(SUM = max(SUM),SEC=SEC[which.max(SUM)],MEAN=MEAN[which.max(SUM)]), by = list(COMP)]
}
arun <- function(x=copy(dt2)) {
setkey(x, COMP, SUM)
x[J(unique(COMP)), mult="last"]
}
microbenchmark(ans1 <- metrics(dt2), ans2 <- arun(dt2), times=20)
# Unit: milliseconds
# expr min lq median uq max neval
# ans1 <- metrics(dt2) 749.0001 804.0651 838.0750 882.3869 1053.3389 20
# ans2 <- arun(dt2) 301.7696 321.6619 342.4779 359.9343 392.5902 20
setkey(ans1, COMP, SEC)
setkey(ans2, COMP, SEC)
setcolorder(ans1, names(ans2))
identical(ans1, ans2) # [1] TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With