using data.table with multiple threads in R

Tags:

Is there a way to utilize multiple threads for computation using data.table in R? For example let's say i have the following data.table:

dtb <- data.table(id=rep(1:10000, 1000), x=1:1e7)
setkey(dtb, id)
f <- function(m) { #some really complicated function }
res <- dtb[,f(x), by=id]

Is there a way to get R to multithread this if f takes a while to compute? What about in the case that f is quick, will multithreading help or is most of the time going to be taken by data.table in splitting things up into groups?

838

asked Aug 17 '12 20:08

Alex

1 Answers

I am not sure that this is "multi-threading", but perhaps you meant to include a multi-core solution? If so, then look at this earlier answer: Performing calculations by subsets of data in R found with a search for "[r] [data.table] parallel"

Edit: (doubling of speed on a 4 core machine, but my system monitor suggests this only used 2 cores during the mclapply call.) Code copied from this thread: http://r.789695.n4.nabble.com/Access-to-local-variables-in-quot-j-quot-expressions-tt2315330.html#a2315337

 calc.fake.dt.mclapply <- function (dt) {
     mclapply(6*c(1000,1:4,6,8,10),
              function(critical.age) {
                  dt$tmp <-  pmax((dt$age <  critical.age) * dt$x, 0)
                  dt[, cumsum.lag(tmp), by = grp]$V1})
 } 
 mk.fake.df <- function (n.groups=10000, n.per.group=70) {
    data.frame(grp=rep(1:n.groups, each=n.per.group),
               age=rep(0:(n.per.group-1), n.groups),
               x=rnorm(n.groups * n.per.group),
               ## These don't do anything, but only exist to give
               ## the table a similar size to the real data.
               y1=rnorm(n.groups * n.per.group),
               y2=rnorm(n.groups * n.per.group),
               y3=rnorm(n.groups * n.per.group),
               y4=rnorm(n.groups * n.per.group)) } 
 df <- mk.fake.df 
 df <- mk.fake.df()
 calc.fake.dt.lapply <- function (dt) { # use base lapply for testing
     lapply(6*c(1000,1:4,6,8,10),
            function(critical.age) {
                dt$tmp <-  pmax((dt$age <  critical.age) * dt$x, 0)
                dt[, cumsum.lag(tmp), by = grp]$V1})
 } 
 mk.fake.dt <- function (fake.df) {
    fake.dt <- as.data.table(fake.df)
    setkey(fake.dt, grp, age)
    fake.dt
  } 
 dt <- mk.fake.dt()

require(data.table)
dt <- mk.fake.dt(df)

 cumsum.lag <- function (x) {
    x.prev <- c(0, x[-length(x)])
    cumsum(x.prev)
  } 
 system.time(res.dt.mclapply <- calc.fake.dt.mclapply(dt))
  user  system elapsed 
 1.896   4.413   1.210 

system.time(res.dt.lapply   <- calc.fake.dt.lapply(dt))
   user  system elapsed 
  1.391   0.793   2.175

151

answered Nov 15 '22 04:11

IRTFM

Related questions
                            
                                How to multiply two values and store the result atomically?
                            
                                Should interlocked implementations based on CompareExchange use SpinWait?
                            
                                Can I monitor the size of a thread's message queue?
                            
                                OS Multi threading differences
                            
                                C# Monte Carlo Incremental Risk Calculation optimisation, random numbers, parallel execution
                            
                                WCF Service with callbacks coming from background thread?
                            
                                fastest way to wake up a thread without using condition variable
                            
                                PyGTK blocking non-GUI threads
                            
                                Android Surfaceview Threads and memory leaks
                            
                                How to create threads with Win32 API? [closed]
                            
                                Minimum time a thread can pause in Linux
                            
                                Accessing request scoped beans in a multi-threaded web application
                            
                                python: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.0
                            
                                A good small example to demonstrate wait() and notify() method in java
                            
                                What is ReentrantLock#tryLock(long,TimeUnit) doing when it tries to aquire the lock?
                            
                                Testing asynchronous code with JUnit
                            
                                True multithreading with boost.python
                            
                                Thread Pool handling 'duplicate' tasks
                            
                                Concurrent C++11 - Which toolchains can be used?
                            
                                How to deal with badly behaved libraries that don't stop threads

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

using data.table with multiple threads in R

Tags:

r

multithreading

data.table

apply

Alex

People also ask

1 Answers

IRTFM

Recent Activity

Donate For Us