Background:
I can get multiple moments from my data using data.table (see appended), but it is taking a very long time. I was thinking that the process of sorting the table to get a particular percentile would be more efficient for finding several.
A once-through statistic like median is taking 1.79 ms while the non-median quantile is taking 68x longer at 122.8 ms. There has to be a way to reduce the compute time.
Questions:
My example code with tiny synthetic data:
#libraries
library(data.table) #data.table
library(zoo) #roll apply
#reproducibility
set.seed(45L)
#make data
DT<-data.table(V1=c(1L,2L),
V2=LETTERS[1:3],
V3=round(rnorm(300),4),
V4=round(runif(150),4),
V5=1:1200)
DT
#get names
my_col_list <- names(DT)[c(3,4)]
#make new variable names
my_name_list1 <- paste0(my_col_list, "_", "33rd_pctile")
my_name_list2 <- paste0(my_col_list, "_", "77rd_pctile")
#compute values
for(i in 1:length(my_col_list)){
#first
DT[, (my_name_list1[i]) := unlist(lapply(.SD,
function(x) rollapply(x,
7,
quantile,
fill = NA,
probs = 1/3)),
recursive = F),
.SDcols = my_col_list[i]]
#second
DT[, (my_name_list2[i]) := unlist(lapply(.SD,
function(x) rollapply(x,
7,
quantile,
fill = NA,
probs = 7/9)),
recursive = F),
.SDcols = my_col_list[i]]
}
#display it
head(DT,10)
Microbenchmarking a once-through statistic vs. the many-through statistics says that the quantiles are expensive.
res2 <- microbenchmark( DT[, (my_name_list1[i]) := unlist(lapply(.SD,
function(x) rollapply(x,
7,
mean,
fill = NA)),
recursive = F),
.SDcols = my_col_list[i]],
times = 5)
says it takes about 1.75 millseconds for a mean (median is 1.79 sec)
> res2
Unit: milliseconds
expr
DT[, `:=`((my_name_list1[i]), unlist(lapply(.SD, function(x) rollapply(x, 7, mean, fill = NA)), recursive = F)), .SDcols = my_col_list[i]]
min lq mean median uq max neval
1.465779 1.509114 1.754145 1.618591 1.712103 2.46514 5
but it takes 100x that to compute a quantile
res3 <- microbenchmark( DT[, (my_name_list1[i]) := unlist(lapply(.SD,
function(x) rollapply(x,
7,
quantile,
fill = NA,
probs = 1/3)),
recursive = F),
.SDcols = my_col_list[i]],
times = 5)
res3
and
> res3
Unit: milliseconds
expr
DT[, `:=`((my_name_list1[i]), unlist(lapply(.SD, function(x) rollapply(x, 7, quantile, fill = NA, probs = 1/3)), recursive = F)), .SDcols = my_col_list[i]]
min lq mean median uq max neval
118.5833 119.2896 122.8432 124.0168 124.4183 127.9082 5
UPDATES:
Thoughts:
You are correct that the median is especially fast in this case, this is because it is running specialized C code unlike the quantile function which is pure R-code.
We can read about this optimization in the Documentation of data.table
in
?data.table.optimize
There we have:
When expressions in j which contains only these functions min, max, mean, median, var, sd, prod, for example, dt[, list(mean(x), median(x), min(y), max(y)), by=z], they are very effectively optimised using, what we call, GForce. These functions are replaced with gmean, gmedian, gmin, gmax instead
And they give an example for the speed improvement for the median case:
# Generate a big data.table with a relatively many columns
set.seed(1L)
dt = lapply(1:20, function(x) sample(c(-100:100), 5e6L, TRUE))
setDT(dt)[, id := sample(1e5, 5e6, TRUE)]
print(object.size(dt), units="Mb") # 400MB, not huge, but will do
# GForce
options(datatable.optimize = 2L) # optimisation 'on'
system.time(ans1 <- dt[, lapply(.SD, median), by=id])
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(stats::median(x))), by=id])
identical(ans1, ans2)
On my system the R internal version is about 44x slower than the data.table version.
We can still try to improve the speed of the quantile
function in R, for this my approach is basically "Use the source, Luke" and looking at the quantile function. Looking at the source we get standard generic function:
>> quantile
function (x, ...)
UseMethod("quantile")
<bytecode: 0x0000000009154c78>
<environment: namespace:stats>
We can trace this a bit more:
>> methods(quantile)
[1] quantile.default* quantile.ecdf* quantile.POSIXt* quantile.zoo
see '?methods' for accessing help and source code
and have a look at the default function.
>> stats:::quantile.default
function (x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE,
type = 7, ...)
{
...
}
Now we have the entire source, which is quite long, we can compare it to the R median source in median.default
. With the source we can copy it as a user defined function and profile it (with a small inclusion of supplying the namespace for format_perc
), from this it is clear that only two lines are relevant, namely the sorting and the output formatting, the sorting is very similar to the median function and likely hard to improve. The formatting however can be skipped entirely by commenting it out.
fast.quant <- function (x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE,
type = 7, ...)
{
if (is.factor(x)) {
...
...
if (names && np > 0L) {
#names(qs) <- stats:::format_perc(probs)
}
...
}
All in all this fix cuts the run time in half, it is still not the optimized median but most likely it is hard to get better performance without leaving R.
It is possible, and maybe even probable, that the optimizations in data.table can be leveraged to help with the quantile calulations as well, since data.table implements sorting in C as well. One would however still like to leverage that only partial sorting is required. Otherwise the Rcpp
package can be used as well to perform a similar optimization.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With