In the changelog of data.table v1.12.0
I noticed the following:
Subsetting, ordering and grouping now use more parallelism
I tested if I could speed-up some grouping but without any success. I made several different tests and I always get identical results. Is grouping actually parallelised? Maybe I do not properly use the thread options? As you can see data.table
has been compiled with openmp
otherwise setDTthread
print a message to tell the user that there is no support of openmp
. Here a reproducile example of one of my tests.
library(data.table)
n = 5e6
k = 1e4
DT = data.table(x = runif(n), y = runif(n), grp = sample(1:k, n, TRUE))
# Any function not too fast
f = function(x,y) as.list(eigen(cov(cbind(x,y)), only.values = TRUE)$value)
setDTthreads(1)
getDTthreads()
#> [1] 1
system.time(DT[ , f(x,y), by = grp])
#> utilisateur système écoulé
#> 3.365 0.008 3.374
setDTthreads(0)
getDTthreads(T)
#> omp_get_max_threads() = 4
#> omp_get_thread_limit() = 2147483647
#> DTthreads = 0
#> RestoreAfterFork = true
#> [1] 4
system.time(DT[ , f(x,y), by = grp])
#> utilisateur système écoulé
#> 3.324 0.029 3.238
Created on 2019-01-27 by the reprex package (v0.2.1)
Data. table is an extension of data. frame package in R. It is widely used for fast aggregation of large datasets, low latency add/update/remove of columns, quicker ordered joins, and a fast file reader.
Yes, grouping is parallelized in v 1.12.0
Your benchmark is a bit of a red herring. You want a fast f(x, y)
if you want to isolate the speed of grouping. Using the cardinalities of your examples but with a trivial function we get:
library(data.table)
packageVersion("data.table")
#> [1] '1.12.0'
n = 5e6
N <- n
k = 1e4
print(getDTthreads())
#> [1] 12
DT = data.table(x = rep_len(runif(n), N),
y = rep_len(runif(n), N),
grp = rep_len(sample(1:k, n, TRUE), N))
bench::system_time(DT[, .(a = 1L), by = "grp"])
#> process real
#> 250.000ms 72.029ms
setDTthreads(1)
bench::system_time(DT[, .(a = 1L), by = "grp"])
#> process real
#> 125.000ms 126.385ms
Created on 2019-02-01 by the reprex package (v0.2.1)
That is, we were slightly faster in the parallel case, but only by about 50 ms -- negligible compared to the 3 s of your function.
If we bump out the size of the DT, we can see a more dramatic difference:
library(data.table)
packageVersion("data.table")
#> [1] '1.12.0'
n = 5e6
N <- 1e9
k = 1e4
print(getDTthreads())
#> [1] 12
DT = data.table(x = rep_len(runif(n), N),
y = rep_len(runif(n), N),
grp = rep_len(sample(1:k, n, TRUE), N))
bench::system_time(DT[, .(a = 1L), by = "grp"])
#> process real
#> 45.719s 14.485s
setDTthreads(1)
bench::system_time(DT[, .(a = 1L), by = "grp"])
#> process real
#> 24.859s 24.890s
sessioninfo::session_info()
#> - Session info ----------------------------------------------------------
#> setting value
#> version R version 3.5.2 (2018-12-20)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_Australia.1252
#> ctype English_Australia.1252
#> tz Australia/Sydney
#> date 2019-02-01
#>
Created on 2019-02-01 by the reprex package (v0.2.1)
Here an answer not validated by any authority (e.g. a member of data.table
team) based on my exploration of the data.table
issues from the github repository.
From issue #3042 I understand that sum
and mean
are optimized. We can benchmark it to verify that it is correct:
library(data.table)
n = 1e7 ; k = 1e5
DT = data.table(x = runif(n), y = runif(n), grp = sample(1:k, n, TRUE))
setDTthreads(1)
system.time(DT[ , mean(x), by = grp]) #> 0.8 s
setDTthreads(0)
system.time(DT[ , mean(x), by = grp]) #> 0.4 s
However Matt Dowle in the same issue #3042 wrote:
There is much left to do on extending to other gforce functions and grouping arbitrary functions
And in #3130 sritchie73 wrote
Worth noting here that R functions are inherently not thread safe, e.g. so they can't be passed to multithreaded C++ code via Rcpp.
So, it seems that parallelization of user-defined functions is not a simple task and there is no current implementation in data.table
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With