Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently calculate multiple quantiles of column z when grouping by columns (x, y)

Tags:

r

data.table

For example,

set.seed(2017)
dt <- data.table(
  x = sample(LETTERS, size = 10^7, replace = T),
  y = sample(LETTERS, size = 10^7, replace = T),
  z = sample(10^7, size = 10^7, replace = T)
)

          x y       z
       1: Y U 6830565
       2: N V 5063188
       3: M L 7551719
      ---            
 9999998: C J 8951173
 9999999: C T 2265750
10000000: L B 1303897

I would like to group by (x, y) and calculate multiple things regarding z. This works,

system.time(
  result1 <- dt[, list(
    firstval = first(z), 
    q1 = quantile(z, 0.5), 
    q2 = quantile(z, 0.75),
    q3 = quantile(z, 0.9)
    ), keyby=list(x, y)]
)

     x y firstval      q1      q2      q3
  1: A A  4600349 4982000 7469398 8981072
  2: A B  9731616 4984859 7493566 9028473
  3: A C  3672771 5071190 7496436 8972589
 ---                                     
674: Z X  1908196 5048164 7521350 9014482
675: Z Y  2933076 4985024 7500346 8977680
676: Z Z  6215175 5018085 7524220 9031993

but is unnecessarily slow because each call to quantile is presumably re-ordering the same data. I can vectorize this as follows

system.time(
  result2 <- dt[, list(
    firstval = first(z), 
    quant = c(0.5, 0.75, 0.9), 
    val = quantile(z, c(0.5, 0.75, 0.9))
  ), keyby=list(x, y)]
)

      x y firstval quant     val
   1: A A  4600349  0.50 4982000
   2: A A  4600349  0.75 7469398
   3: A A  4600349  0.90 8981072
  ---                           
2026: Z Z  6215175  0.50 5018085
2027: Z Z  6215175  0.75 7524220
2028: Z Z  6215175  0.90 9031993

^^ this is faster but now my data is the wrong shape (and could be significantly, unnecessarily larger in size). I can dcast the result to get what I want, but this is still memory inefficient.

How can I calculate the quantiles efficiently without transforming the shape of my data? Ideally, I'm looking for something like

result3 <- dt[, c(
  list(firstval = first(z)), 
  c("q1", "q2", "q3") = t(quantile(z, c(0.5, 0.75, 0.9)))
), keyby=list(x, y)]
like image 337
Ben Avatar asked Dec 07 '17 19:12

Ben


2 Answers

We could use as.list instead of calling quantile multiple times

system.time(
 result2 <- dt[, c(list(
  firstval = first(z)), as.list( 
  quantile(z, c(0.5, 0.75, 0.9)))
 ), keyby=list(x, y)]
)
like image 103
akrun Avatar answered Oct 19 '22 23:10

akrun


@akrun already got a good solution, but you could also use code-block inside the data.table call:

system.time(
  result2b <- dt[, {
      the_quantiles = quantile(z, c(0.5, 0.75, 0.9))
      list(
        firstval = first(z), 
        q1 = the_quantiles[1], 
        q2 = the_quantiles[2],
        q3 = the_quantiles[3]
      )
    }, keyby=list(x, y)]
)

Notice how the_quantiles are first computed in one call, and then a code very similar to your first solution is used.

like image 29
Jealie Avatar answered Oct 19 '22 22:10

Jealie