Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is grouping parallelised in data.table 1.12.0?

In the changelog of data.table v1.12.0 I noticed the following:

Subsetting, ordering and grouping now use more parallelism

I tested if I could speed-up some grouping but without any success. I made several different tests and I always get identical results. Is grouping actually parallelised? Maybe I do not properly use the thread options? As you can see data.table has been compiled with openmp otherwise setDTthread print a message to tell the user that there is no support of openmp. Here a reproducile example of one of my tests.

library(data.table)

n = 5e6
k = 1e4

DT = data.table(x = runif(n), y = runif(n), grp = sample(1:k, n, TRUE))

# Any function not too fast
f = function(x,y) as.list(eigen(cov(cbind(x,y)), only.values = TRUE)$value)

setDTthreads(1)
getDTthreads()
#> [1] 1

system.time(DT[ , f(x,y), by = grp])
#> utilisateur     système      écoulé 
#>       3.365       0.008       3.374

setDTthreads(0)
getDTthreads(T)
#> omp_get_max_threads() = 4
#> omp_get_thread_limit() = 2147483647
#> DTthreads = 0
#> RestoreAfterFork = true
#> [1] 4

system.time(DT[ , f(x,y), by = grp])
#> utilisateur     système      écoulé 
#>       3.324       0.029       3.238

Created on 2019-01-27 by the reprex package (v0.2.1)

like image 714
JRR Avatar asked Jan 28 '19 02:01

JRR


People also ask

What package is Data Table in in r?

Data. table is an extension of data. frame package in R. It is widely used for fast aggregation of large datasets, low latency add/update/remove of columns, quicker ordered joins, and a fast file reader.


2 Answers

Yes, grouping is parallelized in v 1.12.0

Your benchmark is a bit of a red herring. You want a fast f(x, y) if you want to isolate the speed of grouping. Using the cardinalities of your examples but with a trivial function we get:

library(data.table)
  packageVersion("data.table")
#> [1] '1.12.0'

n = 5e6
N <- n
k = 1e4

print(getDTthreads())
#> [1] 12

DT = data.table(x = rep_len(runif(n), N),
                y = rep_len(runif(n), N),
                grp = rep_len(sample(1:k, n, TRUE), N))
bench::system_time(DT[, .(a = 1L), by = "grp"])
#>   process      real 
#> 250.000ms  72.029ms

setDTthreads(1)

bench::system_time(DT[, .(a = 1L), by = "grp"])
#>   process      real 
#> 125.000ms 126.385ms

Created on 2019-02-01 by the reprex package (v0.2.1)

That is, we were slightly faster in the parallel case, but only by about 50 ms -- negligible compared to the 3 s of your function.

If we bump out the size of the DT, we can see a more dramatic difference:

library(data.table)
  packageVersion("data.table")
#> [1] '1.12.0'

n = 5e6
N <- 1e9
k = 1e4

print(getDTthreads())
#> [1] 12

DT = data.table(x = rep_len(runif(n), N),
                y = rep_len(runif(n), N),
                grp = rep_len(sample(1:k, n, TRUE), N))
bench::system_time(DT[, .(a = 1L), by = "grp"])
#> process    real 
#> 45.719s 14.485s

setDTthreads(1)

bench::system_time(DT[, .(a = 1L), by = "grp"])
#> process    real 
#> 24.859s 24.890s
sessioninfo::session_info()
#> - Session info ----------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.2 (2018-12-20)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_Australia.1252      
#>  ctype    English_Australia.1252      
#>  tz       Australia/Sydney            
#>  date     2019-02-01                  
#> 

Created on 2019-02-01 by the reprex package (v0.2.1)

like image 150
Hugh Avatar answered Oct 12 '22 11:10

Hugh


Here an answer not validated by any authority (e.g. a member of data.table team) based on my exploration of the data.table issues from the github repository.

From issue #3042 I understand that sum and mean are optimized. We can benchmark it to verify that it is correct:

library(data.table)
n = 1e7 ; k = 1e5
DT = data.table(x = runif(n), y = runif(n), grp = sample(1:k, n, TRUE))

setDTthreads(1)
system.time(DT[ , mean(x), by = grp]) #> 0.8 s
setDTthreads(0)
system.time(DT[ , mean(x), by = grp]) #> 0.4 s

However Matt Dowle in the same issue #3042 wrote:

There is much left to do on extending to other gforce functions and grouping arbitrary functions

And in #3130 sritchie73 wrote

Worth noting here that R functions are inherently not thread safe, e.g. so they can't be passed to multithreaded C++ code via Rcpp.

So, it seems that parallelization of user-defined functions is not a simple task and there is no current implementation in data.table.

like image 28
JRR Avatar answered Oct 12 '22 11:10

JRR