In the changelog of <code>data.table v1.12.0</code> I noticed the following: <blockquote> Subsetting, ordering and grouping now use more parallelism </blockquote> I tested if I could speed-up some grouping but without any success. I made several different tests and I always get identical results. Is grouping actually parallelised? Maybe I do not properly use the thread options? As you can see <code>data.table</code> has been compiled with <code>openmp</code> otherwise <code>setDTthread</code> print a message to tell the user that there is no support of <code>openmp</code>. Here a reproducile example of one of my tests. <pre class="prettyprint"><code>library(data.table) n = 5e6 k = 1e4 DT = data.table(x = runif(n), y = runif(n), grp = sample(1:k, n, TRUE)) # Any function not too fast f = function(x,y) as.list(eigen(cov(cbind(x,y)), only.values = TRUE)$value) setDTthreads(1) getDTthreads() #> [1] 1 system.time(DT[ , f(x,y), by = grp]) #> utilisateur système écoulé #> 3.365 0.008 3.374 setDTthreads(0) getDTthreads(T) #> omp_get_max_threads() = 4 #> omp_get_thread_limit() = 2147483647 #> DTthreads = 0 #> RestoreAfterFork = true #> [1] 4 system.time(DT[ , f(x,y), by = grp]) #> utilisateur système écoulé #> 3.324 0.029 3.238 </code></pre> Created on 2019-01-27 by the reprex package (v0.2.1)

Yes, grouping is parallelized in v 1.12.0 Your benchmark is a bit of a red herring. You want a fast <code>f(x, y)</code> if you want to isolate the speed of grouping. Using the cardinalities of your examples but with a trivial function we get: <pre class="prettyprint lang-r prettyprint-override"><code>library(data.table) packageVersion("data.table") #> [1] '1.12.0' n = 5e6 N <- n k = 1e4 print(getDTthreads()) #> [1] 12 DT = data.table(x = rep_len(runif(n), N), y = rep_len(runif(n), N), grp = rep_len(sample(1:k, n, TRUE), N)) bench::system_time(DT[, .(a = 1L), by = "grp"]) #> process real #> 250.000ms 72.029ms setDTthreads(1) bench::system_time(DT[, .(a = 1L), by = "grp"]) #> process real #> 125.000ms 126.385ms </code></pre> Created on 2019-02-01 by the reprex package (v0.2.1) That is, we were slightly faster in the parallel case, but only by about 50 ms -- negligible compared to the 3 s of your function. If we bump out the size of the DT, we can see a more dramatic difference: <pre class="prettyprint lang-r prettyprint-override"><code>library(data.table) packageVersion("data.table") #> [1] '1.12.0' n = 5e6 N <- 1e9 k = 1e4 print(getDTthreads()) #> [1] 12 DT = data.table(x = rep_len(runif(n), N), y = rep_len(runif(n), N), grp = rep_len(sample(1:k, n, TRUE), N)) bench::system_time(DT[, .(a = 1L), by = "grp"]) #> process real #> 45.719s 14.485s setDTthreads(1) bench::system_time(DT[, .(a = 1L), by = "grp"]) #> process real #> 24.859s 24.890s </code></pre> <pre class="prettyprint lang-r prettyprint-override"><code>sessioninfo::session_info() #> - Session info ---------------------------------------------------------- #> setting value #> version R version 3.5.2 (2018-12-20) #> os Windows 10 x64 #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_Australia.1252 #> ctype English_Australia.1252 #> tz Australia/Sydney #> date 2019-02-01 #> </code></pre> Created on 2019-02-01 by the reprex package (v0.2.1)

Here an answer not validated by any authority (e.g. a member of <code>data.table</code> team) based on my exploration of the <code>data.table</code> issues from the github repository. From issue #3042 I understand that <code>sum</code> and <code>mean</code> are optimized. We can benchmark it to verify that it is correct: <pre class="prettyprint"><code>library(data.table) n = 1e7 ; k = 1e5 DT = data.table(x = runif(n), y = runif(n), grp = sample(1:k, n, TRUE)) setDTthreads(1) system.time(DT[ , mean(x), by = grp]) #> 0.8 s setDTthreads(0) system.time(DT[ , mean(x), by = grp]) #> 0.4 s </code></pre> However Matt Dowle in the same issue #3042 wrote: <blockquote> There is much left to do on extending to other gforce functions and grouping arbitrary functions </blockquote> And in #3130 sritchie73 wrote <blockquote> Worth noting here that R functions are inherently not thread safe, e.g. so they can't be passed to multithreaded C++ code via Rcpp. </blockquote> So, it seems that parallelization of user-defined functions is not a simple task and there is no current implementation in <code>data.table</code>.

Is grouping parallelised in data.table 1.12.0?

Tags:

r

data.table

openmp

In the changelog of data.table v1.12.0 I noticed the following:

Subsetting, ordering and grouping now use more parallelism

I tested if I could speed-up some grouping but without any success. I made several different tests and I always get identical results. Is grouping actually parallelised? Maybe I do not properly use the thread options? As you can see data.table has been compiled with openmp otherwise setDTthread print a message to tell the user that there is no support of openmp. Here a reproducile example of one of my tests.

library(data.table)

n = 5e6
k = 1e4

DT = data.table(x = runif(n), y = runif(n), grp = sample(1:k, n, TRUE))

# Any function not too fast
f = function(x,y) as.list(eigen(cov(cbind(x,y)), only.values = TRUE)$value)

setDTthreads(1)
getDTthreads()
#> [1] 1

system.time(DT[ , f(x,y), by = grp])
#> utilisateur     système      écoulé 
#>       3.365       0.008       3.374

setDTthreads(0)
getDTthreads(T)
#> omp_get_max_threads() = 4
#> omp_get_thread_limit() = 2147483647
#> DTthreads = 0
#> RestoreAfterFork = true
#> [1] 4

system.time(DT[ , f(x,y), by = grp])
#> utilisateur     système      écoulé 
#>       3.324       0.029       3.238

^{Created on 2019-01-27 by the reprex package (v0.2.1)}

714

asked Jan 28 '19 02:01

JRR

2 Answers

Yes, grouping is parallelized in v 1.12.0

Your benchmark is a bit of a red herring. You want a fast f(x, y) if you want to isolate the speed of grouping. Using the cardinalities of your examples but with a trivial function we get:

library(data.table)
  packageVersion("data.table")
#> [1] '1.12.0'

n = 5e6
N <- n
k = 1e4

print(getDTthreads())
#> [1] 12

DT = data.table(x = rep_len(runif(n), N),
                y = rep_len(runif(n), N),
                grp = rep_len(sample(1:k, n, TRUE), N))
bench::system_time(DT[, .(a = 1L), by = "grp"])
#>   process      real 
#> 250.000ms  72.029ms

setDTthreads(1)

bench::system_time(DT[, .(a = 1L), by = "grp"])
#>   process      real 
#> 125.000ms 126.385ms

^{Created on 2019-02-01 by the reprex package (v0.2.1)}

That is, we were slightly faster in the parallel case, but only by about 50 ms -- negligible compared to the 3 s of your function.

If we bump out the size of the DT, we can see a more dramatic difference:

library(data.table)
  packageVersion("data.table")
#> [1] '1.12.0'

n = 5e6
N <- 1e9
k = 1e4

print(getDTthreads())
#> [1] 12

DT = data.table(x = rep_len(runif(n), N),
                y = rep_len(runif(n), N),
                grp = rep_len(sample(1:k, n, TRUE), N))
bench::system_time(DT[, .(a = 1L), by = "grp"])
#> process    real 
#> 45.719s 14.485s

setDTthreads(1)

bench::system_time(DT[, .(a = 1L), by = "grp"])
#> process    real 
#> 24.859s 24.890s

sessioninfo::session_info()
#> - Session info ----------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.2 (2018-12-20)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_Australia.1252      
#>  ctype    English_Australia.1252      
#>  tz       Australia/Sydney            
#>  date     2019-02-01                  
#>

^{Created on 2019-02-01 by the reprex package (v0.2.1)}

150

answered Oct 12 '22 11:10

Hugh

Here an answer not validated by any authority (e.g. a member of data.table team) based on my exploration of the data.table issues from the github repository.

From issue #3042 I understand that sum and mean are optimized. We can benchmark it to verify that it is correct:

library(data.table)
n = 1e7 ; k = 1e5
DT = data.table(x = runif(n), y = runif(n), grp = sample(1:k, n, TRUE))

setDTthreads(1)
system.time(DT[ , mean(x), by = grp]) #> 0.8 s
setDTthreads(0)
system.time(DT[ , mean(x), by = grp]) #> 0.4 s

However Matt Dowle in the same issue #3042 wrote:

There is much left to do on extending to other gforce functions and grouping arbitrary functions

And in #3130 sritchie73 wrote

Worth noting here that R functions are inherently not thread safe, e.g. so they can't be passed to multithreaded C++ code via Rcpp.

So, it seems that parallelization of user-defined functions is not a simple task and there is no current implementation in data.table.

answered Oct 12 '22 11:10

JRR

Related questions
                            
                                segfault in R using reshape2 package and dcast
                            
                                How to handle C++ internal data structure in R in order to allow save/load?
                            
                                knitr plots, labels, and captions within one chunk
                            
                                navlistPanel: Make tabs sequentially active in shiny app
                            
                                Connect to Redshift via SSL using R
                            
                                Export data into specific cells in Excel sheet
                            
                                Clustering on a Map using R and Leaflet
                            
                                Extend dimensions in netCDF file using R
                            
                                Tabulating characters with diacritics in R
                            
                                Use ... to modify a nested list within a functional
                            
                                Capitalize first letter after special characters
                            
                                How to plot parallel coordinates with multiple categorical variables in R
                            
                                How to keep join column unchanged in data.table non-equi join?
                            
                                "Could not find function" in Roxygen examples during CMD check
                            
                                reg.finalizer() in an R package does not execute at the end of an R session
                            
                                Read large csv file from S3 into R
                            
                                Adding a polygon to a scatter plotly while retaining the hover info
                            
                                Strange behaviour for data.frames without column names
                            
                                readLines function with new version of R
                            
                                `ggplot2` axis.text margin with modified scale position

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With