For example,
set.seed(2017)
dt <- data.table(
x = sample(LETTERS, size = 10^7, replace = T),
y = sample(LETTERS, size = 10^7, replace = T),
z = sample(10^7, size = 10^7, replace = T)
)
x y z
1: Y U 6830565
2: N V 5063188
3: M L 7551719
---
9999998: C J 8951173
9999999: C T 2265750
10000000: L B 1303897
I would like to group by (x, y) and calculate multiple things regarding z. This works,
system.time(
result1 <- dt[, list(
firstval = first(z),
q1 = quantile(z, 0.5),
q2 = quantile(z, 0.75),
q3 = quantile(z, 0.9)
), keyby=list(x, y)]
)
x y firstval q1 q2 q3
1: A A 4600349 4982000 7469398 8981072
2: A B 9731616 4984859 7493566 9028473
3: A C 3672771 5071190 7496436 8972589
---
674: Z X 1908196 5048164 7521350 9014482
675: Z Y 2933076 4985024 7500346 8977680
676: Z Z 6215175 5018085 7524220 9031993
but is unnecessarily slow because each call to quantile is presumably re-ordering the same data. I can vectorize this as follows
system.time(
result2 <- dt[, list(
firstval = first(z),
quant = c(0.5, 0.75, 0.9),
val = quantile(z, c(0.5, 0.75, 0.9))
), keyby=list(x, y)]
)
x y firstval quant val
1: A A 4600349 0.50 4982000
2: A A 4600349 0.75 7469398
3: A A 4600349 0.90 8981072
---
2026: Z Z 6215175 0.50 5018085
2027: Z Z 6215175 0.75 7524220
2028: Z Z 6215175 0.90 9031993
^^ this is faster but now my data is the wrong shape (and could be significantly, unnecessarily larger in size). I can dcast
the result to get what I want, but this is still memory inefficient.
How can I calculate the quantiles efficiently without transforming the shape of my data? Ideally, I'm looking for something like
result3 <- dt[, c(
list(firstval = first(z)),
c("q1", "q2", "q3") = t(quantile(z, c(0.5, 0.75, 0.9)))
), keyby=list(x, y)]
We could use as.list
instead of calling quantile
multiple times
system.time(
result2 <- dt[, c(list(
firstval = first(z)), as.list(
quantile(z, c(0.5, 0.75, 0.9)))
), keyby=list(x, y)]
)
@akrun already got a good solution, but you could also use code-block inside the data.table
call:
system.time(
result2b <- dt[, {
the_quantiles = quantile(z, c(0.5, 0.75, 0.9))
list(
firstval = first(z),
q1 = the_quantiles[1],
q2 = the_quantiles[2],
q3 = the_quantiles[3]
)
}, keyby=list(x, y)]
)
Notice how the_quantiles
are first computed in one call, and then a code very similar to your first solution is used.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With