Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

reading and processing files in parallel in R

I am using the parallel library in R to process a large data set on which I am applying complex operations.

For the sake of providing a reproducible code, you can find below a simpler example:

#data generation
dir <- "C:/Users/things_to_process/"

setwd(dir)
for(i in 1:800)
{
    my.matrix <- matrix(runif(100),ncol=10,nrow=10)

    saveRDS(my.matrix,file=paste0(dir,"/matrix",i))
}

#worker function
worker.function <- function(files)
{
    files.length <- length(files)
    partial.results <- vector('list',files.length)

    for(i in 1:files.length)
    {
        matrix <- readRDS(files[i])
        partial.results[[i]] <- sum(diag(matrix))
    }

    Reduce('+',partial.results) 
}


#master part
cl <- makeCluster(detectCores(), type = "PSOCK")

file_list <- list.files(path=dir,recursive=FALSE,full.names=TRUE)

part <- clusterSplit(cl,seq_along(file_list))
files.partitioned <- lapply(part,function(p) file_list[p])

results <- clusterApply(cl,files.partitioned,worker.function)

result <- Reduce('+',results)

Essentially, I am wondering if trying to read files in parallel would be done in an interleaved fashion instead. And if, as a result, this bottleneck would cut down on the expected performance of running tasks in parallel?

Would it be better if I first read all matrices at once in a list then sent chunks of this list to each core for it to be processed? what if these matrices were much larger, would I be able to load all of them in a list at once ?

like image 229
Imlerith Avatar asked Aug 06 '16 02:08

Imlerith


People also ask

What is parallel processing in R?

Parallel processing (in the extreme) means that all the f# processes start simultaneously and run to completion on their own. If we have a single computer at our disposal and have to run n models, each taking s seconds, the total running time will be n*s .

Does R support parallel computing?

There are various packages in R which allow parallelization. “parallel” Package The parallel package in R can perform tasks in parallel by providing the ability to allocate cores to R. The working involves finding the number of cores in the system and allocating all of them or a subset to make a cluster.

What makes a good R code runs in parallel?

Running R code in parallel can be very useful in speeding up performance. Basically, parallelization allows you to run multiple processes in your code simultaneously, rather than than iterating over a list one element at a time, or running a single process at a time.


1 Answers

Instead of saving each matrix in a separate RDS file, have you tried saving a list of N matrices in each file, where N is the number that is going to be processed by a single worker?

Then the worker.function looks like:

worker.function <- function(file) {
    matrix_list <- readRDS(file)
    partial_results <- lapply(matrix_list, function(mat) sum(diag(mat)))
    Reduce('+',partial.results)
}

You should save some time on I/O and maybe even on computation by replacing a for with a lapply.

like image 64
Philippe Marchand Avatar answered Oct 08 '22 11:10

Philippe Marchand