Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to overcome memory constraints using foreach

I am trying to process > 10000 xts objects saved on disk each being around 0.2 GB when loaded into R. I would like to use foreach to process these in parallel. My code works for something like 100 xts objects which I pre-load in memory, export etc. but after > 100 xts objects I hit memory limits on my machine.

Example of what I am trying to do:

require(TTR)
require(doMPI)
require(foreach)

test.data <- runif(n=250*10*60*24)

xts.1 <- xts(test.data, order.by=as.Date(1:length(test.data)))
xts.1 <- cbind(xts.1, xts.1, xts.1, xts.1, xts.1, xts.1)

colnames(xts.1) <- c("Open", "High", "Low", "Close", "Volume", "Adjusted")

print(object.size(xts.1), units="Gb")

xts.2 <- xts.1
xts.3 <- xts.1
xts.4 <- xts.1

save(xts.1, file="xts.1.rda")
save(xts.2, file="xts.2.rda")
save(xts.3, file="xts.3.rda")
save(xts.4, file="xts.4.rda")

names <- c("xts.1", "xts.2", "xts.3", "xts.4")

rm(xts.1)
rm(xts.2)
rm(xts.3)
rm(xts.4)

cl <- startMPIcluster(count=2) # Use 2 cores
registerDoMPI(cl)

result <- foreach(name=names, 
                  .combine=cbind, 
                  .multicombine=TRUE, 
                  .inorder=FALSE, 
                  .packages=c("TTR")) %dopar% {
    # TODO: Move following line out of worker. One (or 5, 10,
    # 20, ... but not all) object at a time should be loaded 
    # by master and exported to worker "just in time"
    load(file=paste0(name, ".rda"))

    return(last(SMA(get(name)[, 1], 10)))
}

closeCluster(cl)

print(result)

So I am wondering how I would be able to load each (or several like 5, 10, 20, 100, ... but not all at once) xts object from disk "just in time" before they are sent/needed by/exported to worker(s). I can not load the object in worker (based on name and folder where it is stored on disk) since workers can be on remote machines without access to the folder where objects are stored on disk. So I need to be able to read/load them "just in time" in main process...

I am using doMPI and doRedis as parallel back-end. doMPI seems more memory efficient but slower than doRedis (on 100 objects).

So I would like to understand what is a proper "strategy"/"pattern" to approach this problem.

like image 330
Samo Avatar asked Oct 01 '22 07:10

Samo


1 Answers

In addition to using doMPI or doRedis, you need to write a function that returns an appropriate iterator. There are a number of examples in my vignette "Writing Custom Iterators" from the iterators package that should be helpful, but here's a quick attempt at such a function:

ixts <- function(xtsnames) {
  it <- iter(xtsnames)

  nextEl <- function() {
    xtsname <- nextElem(it)  # throws "StopIteration"
    load(file=paste0(xtsname, ".rda"))
    get(xtsname)
  }

  obj <- list(nextElem=nextEl)
  class(obj) <- c('ixts', 'abstractiter', 'iter')
  obj
}

This is really simple since it's basically a wrapper around an iterator over the "names" variable. The vignette uses this technique for several of the examples.

You can use "ixts" with foreach as follows:

result <- foreach(xts=ixts(names),
                  .combine=cbind, 
                  .multicombine=TRUE, 
                  .inorder=FALSE, 
                  .packages=c("TTR")) %dopar% {
    last(SMA(xts[, 1], 10))
}

Although this iterator will work with any foreach backend, not all backends will call it just-in-time. doMPI and doRedis will, but doParallel and doMC get all of the values from the iterator up front because clusterApplyLB and mclapply require the values to all be in a list. doMPI and doRedis were designed to work with iterators in order to be more memory efficient.

like image 69
Steve Weston Avatar answered Oct 11 '22 16:10

Steve Weston