Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

roll applying multiple quantiles in data table to multiple columns

Background:
I can get multiple moments from my data using data.table (see appended), but it is taking a very long time. I was thinking that the process of sorting the table to get a particular percentile would be more efficient for finding several.

A once-through statistic like median is taking 1.79 ms while the non-median quantile is taking 68x longer at 122.8 ms. There has to be a way to reduce the compute time.

Questions:

  • Is there a way to call several quantiles from the same data in a more efficient manner?
  • Can I pull the "lapply" out of the data.table and compose it like I do the name lists?

My example code with tiny synthetic data:

#libraries
library(data.table)      #data.table
library(zoo)             #roll apply

#reproducibility
set.seed(45L)

#make data
DT<-data.table(V1=c(1L,2L),
               V2=LETTERS[1:3],
               V3=round(rnorm(300),4),
               V4=round(runif(150),4),
               V5=1:1200)
DT

#get names
my_col_list <- names(DT)[c(3,4)]

#make new variable names
my_name_list1 <- paste0(my_col_list, "_", "33rd_pctile")
my_name_list2 <- paste0(my_col_list, "_", "77rd_pctile")

#compute values
for(i in 1:length(my_col_list)){
     #first 
     DT[, (my_name_list1[i]) := unlist(lapply(.SD,
                                          function(x) rollapply(x,
                                                                7,
                                                                quantile,
                                                                fill = NA,
                                                                probs = 1/3)), 
                                   recursive = F),
        .SDcols = my_col_list[i]]
     #second
     DT[, (my_name_list2[i]) := unlist(lapply(.SD,
                                          function(x) rollapply(x,
                                                                7,
                                                                quantile,
                                                                fill = NA,
                                                                probs = 7/9)), 
                                   recursive = F),
        .SDcols = my_col_list[i]]
}

#display it
head(DT,10)

Microbenchmarking a once-through statistic vs. the many-through statistics says that the quantiles are expensive.

res2 <- microbenchmark(          DT[, (my_name_list1[i]) := unlist(lapply(.SD,
                                                                          function(x) rollapply(x,
                                                                                                7,
                                                                                                mean,
                                                                                                fill = NA)), 
                                                                   recursive = F),
                                    .SDcols = my_col_list[i]],
                                 times = 5)

says it takes about 1.75 millseconds for a mean (median is 1.79 sec)

> res2
Unit: milliseconds
                                                                                                                                            expr
 DT[, `:=`((my_name_list1[i]), unlist(lapply(.SD, function(x) rollapply(x,      7, mean, fill = NA)), recursive = F)), .SDcols = my_col_list[i]]
      min       lq     mean   median       uq     max neval
 1.465779 1.509114 1.754145 1.618591 1.712103 2.46514     5

but it takes 100x that to compute a quantile

res3 <- microbenchmark(          DT[, (my_name_list1[i]) := unlist(lapply(.SD,
                                                                          function(x) rollapply(x,
                                                                                                7,
                                                                                                quantile,
                                                                                                fill = NA,
                                                                                                probs = 1/3)), 
                                                                   recursive = F),
                                    .SDcols = my_col_list[i]],
                                 times = 5)

res3

and

> res3
Unit: milliseconds
                                                                                                                                                             expr
 DT[, `:=`((my_name_list1[i]), unlist(lapply(.SD, function(x) rollapply(x,      7, quantile, fill = NA, probs = 1/3)), recursive = F)), .SDcols = my_col_list[i]]
      min       lq     mean   median       uq      max neval
 118.5833 119.2896 122.8432 124.0168 124.4183 127.9082     5

UPDATES:

  • The median from "quantile" takes ~128 seconds while the "median" takes much less. They are not the same thing.
  • iterating through the 9 options for "type" of "quantile" gives mean times between 129ms and 157ms. There is no "easy-win" here.
  • The package "WGCNA" requires "GO.db" from bioconductor, which is not installed with the "install.packages" command. Also requires package "impute" which is not installed with "WGCNA" or "GO.db". Also "preprocessCore".
  • Using the (eventually working) WGCNA package reduced mean time for
    rolling quantile to 68.1 ms. It is about half the time, but it is not about 1/70th the time.
  • Using "RollingMedian" from the "RollingWindow" package I get 169.8 microseconds (aka 0.1698 milliseconds) which is a LOT faster, but is not an arbitrary quantile.
  • Using "perccal" seems to drop computing a quantile down to ~145
    microseconds. In the rollapply this drops compute time down to 15.3 milliseconds, which is an 8.5x boost. I'm not sure how much more blood there is to squeeze out of this stone.

Thoughts:

  • The "perccal" approach seems to only be using a single core. There may be some "parallel" process that allows me to split the summary against different variables to different cores. This might give some speedup.
  • As we add more terms to the data, the speedup starts reducing. Increasing to 9600 rows reduces the speedup from ~8.5x to below 1x. This suggests that the rollapply function may also have some issues.
like image 497
EngrStudent Avatar asked Jul 24 '17 14:07

EngrStudent


1 Answers

Data table optimization

You are correct that the median is especially fast in this case, this is because it is running specialized C code unlike the quantile function which is pure R-code.

We can read about this optimization in the Documentation of data.table in

?data.table.optimize

There we have:

When expressions in j which contains only these functions min, max, mean, median, var, sd, prod, for example, dt[, list(mean(x), median(x), min(y), max(y)), by=z], they are very effectively optimised using, what we call, GForce. These functions are replaced with gmean, gmedian, gmin, gmax instead

And they give an example for the speed improvement for the median case:

# Generate a big data.table with a relatively many columns
set.seed(1L)
dt = lapply(1:20, function(x) sample(c(-100:100), 5e6L, TRUE))
setDT(dt)[, id := sample(1e5, 5e6, TRUE)]
print(object.size(dt), units="Mb") # 400MB, not huge, but will do

# GForce
options(datatable.optimize = 2L) # optimisation 'on'
system.time(ans1 <- dt[, lapply(.SD, median), by=id])
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(stats::median(x))), by=id])
identical(ans1, ans2)

On my system the R internal version is about 44x slower than the data.table version.

Speeding up quantile

We can still try to improve the speed of the quantile function in R, for this my approach is basically "Use the source, Luke" and looking at the quantile function. Looking at the source we get standard generic function:

>> quantile
function (x, ...) 
UseMethod("quantile")
<bytecode: 0x0000000009154c78>
<environment: namespace:stats>

We can trace this a bit more:

>> methods(quantile)
[1] quantile.default* quantile.ecdf*    quantile.POSIXt*  quantile.zoo     
see '?methods' for accessing help and source code

and have a look at the default function.

>> stats:::quantile.default
function (x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE, 
    type = 7, ...) 
{
...
}

Now we have the entire source, which is quite long, we can compare it to the R median source in median.default. With the source we can copy it as a user defined function and profile it (with a small inclusion of supplying the namespace for format_perc), from this it is clear that only two lines are relevant, namely the sorting and the output formatting, the sorting is very similar to the median function and likely hard to improve. The formatting however can be skipped entirely by commenting it out.

fast.quant <- function (x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE, 
                        type = 7, ...) 
{
  if (is.factor(x)) {
    ...

    ...
    if (names && np > 0L) {
      #names(qs) <- stats:::format_perc(probs)
    }
    ...
}

All in all this fix cuts the run time in half, it is still not the optimized median but most likely it is hard to get better performance without leaving R.

It is possible, and maybe even probable, that the optimizations in data.table can be leveraged to help with the quantile calulations as well, since data.table implements sorting in C as well. One would however still like to leverage that only partial sorting is required. Otherwise the Rcpp package can be used as well to perform a similar optimization.

like image 103
Fredrik Isaksson Avatar answered Nov 15 '22 18:11

Fredrik Isaksson