I am using the <code>parallel</code> library in R to process a large data set on which I am applying complex operations. For the sake of providing a reproducible code, you can find below a simpler example: <pre class="prettyprint"><code>#data generation dir <- "C:/Users/things_to_process/" setwd(dir) for(i in 1:800) { my.matrix <- matrix(runif(100),ncol=10,nrow=10) saveRDS(my.matrix,file=paste0(dir,"/matrix",i)) } #worker function worker.function <- function(files) { files.length <- length(files) partial.results <- vector('list',files.length) for(i in 1:files.length) { matrix <- readRDS(files[i]) partial.results[[i]] <- sum(diag(matrix)) } Reduce('+',partial.results) } #master part cl <- makeCluster(detectCores(), type = "PSOCK") file_list <- list.files(path=dir,recursive=FALSE,full.names=TRUE) part <- clusterSplit(cl,seq_along(file_list)) files.partitioned <- lapply(part,function(p) file_list[p]) results <- clusterApply(cl,files.partitioned,worker.function) result <- Reduce('+',results) </code></pre> Essentially, I am wondering if trying to read files in parallel would be done in an interleaved fashion instead. And if, as a result, this bottleneck would cut down on the expected performance of running tasks in parallel? Would it be better if I first read all matrices at once in a list then sent chunks of this list to each core for it to be processed? what if these matrices were much larger, would I be able to load all of them in a list at once ?

Instead of saving each <code>matrix</code> in a separate RDS file, have you tried saving a <code>list</code> of N matrices in each file, where N is the number that is going to be processed by a single worker? Then the <code>worker.function</code> looks like: <pre class="prettyprint"><code>worker.function <- function(file) { matrix_list <- readRDS(file) partial_results <- lapply(matrix_list, function(mat) sum(diag(mat))) Reduce('+',partial.results) } </code></pre> You should save some time on I/O and maybe even on computation by replacing a <code>for</code> with a <code>lapply</code>.

reading and processing files in parallel in R

I am using the parallel library in R to process a large data set on which I am applying complex operations.

For the sake of providing a reproducible code, you can find below a simpler example:

#data generation
dir <- "C:/Users/things_to_process/"

setwd(dir)
for(i in 1:800)
{
    my.matrix <- matrix(runif(100),ncol=10,nrow=10)

    saveRDS(my.matrix,file=paste0(dir,"/matrix",i))
}

#worker function
worker.function <- function(files)
{
    files.length <- length(files)
    partial.results <- vector('list',files.length)

    for(i in 1:files.length)
    {
        matrix <- readRDS(files[i])
        partial.results[[i]] <- sum(diag(matrix))
    }

    Reduce('+',partial.results) 
}


#master part
cl <- makeCluster(detectCores(), type = "PSOCK")

file_list <- list.files(path=dir,recursive=FALSE,full.names=TRUE)

part <- clusterSplit(cl,seq_along(file_list))
files.partitioned <- lapply(part,function(p) file_list[p])

results <- clusterApply(cl,files.partitioned,worker.function)

result <- Reduce('+',results)

Essentially, I am wondering if trying to read files in parallel would be done in an interleaved fashion instead. And if, as a result, this bottleneck would cut down on the expected performance of running tasks in parallel?

Would it be better if I first read all matrices at once in a list then sent chunks of this list to each core for it to be processed? what if these matrices were much larger, would I be able to load all of them in a list at once ?

What is parallel processing in R?

Parallel processing (in the extreme) means that all the f# processes start simultaneously and run to completion on their own. If we have a single computer at our disposal and have to run n models, each taking s seconds, the total running time will be n*s .

Does R support parallel computing?

There are various packages in R which allow parallelization. “parallel” Package The parallel package in R can perform tasks in parallel by providing the ability to allocate cores to R. The working involves finding the number of cores in the system and allocating all of them or a subset to make a cluster.

What makes a good R code runs in parallel?

Running R code in parallel can be very useful in speeding up performance. Basically, parallelization allows you to run multiple processes in your code simultaneously, rather than than iterating over a list one element at a time, or running a single process at a time.

Instead of saving each matrix in a separate RDS file, have you tried saving a list of N matrices in each file, where N is the number that is going to be processed by a single worker?

Then the worker.function looks like:

worker.function <- function(file) {
    matrix_list <- readRDS(file)
    partial_results <- lapply(matrix_list, function(mat) sum(diag(mat)))
    Reduce('+',partial.results)
}

You should save some time on I/O and maybe even on computation by replacing a for with a lapply.

reading and processing files in parallel in R

Tags:

file-io

r

parallel-processing

large-files

Imlerith

People also ask

1 Answers

Philippe Marchand

Recent Activity

Donate For Us

reading and processing files in parallel in R

Tags:

file-io

r

parallel-processing

large-files

Imlerith

People also ask

1 Answers

Philippe Marchand

Related questions

Recent Activity

Donate For Us