I am using the parallel
library in R to process a large data set on which I am applying complex operations.
For the sake of providing a reproducible code, you can find below a simpler example:
#data generation
dir <- "C:/Users/things_to_process/"
setwd(dir)
for(i in 1:800)
{
my.matrix <- matrix(runif(100),ncol=10,nrow=10)
saveRDS(my.matrix,file=paste0(dir,"/matrix",i))
}
#worker function
worker.function <- function(files)
{
files.length <- length(files)
partial.results <- vector('list',files.length)
for(i in 1:files.length)
{
matrix <- readRDS(files[i])
partial.results[[i]] <- sum(diag(matrix))
}
Reduce('+',partial.results)
}
#master part
cl <- makeCluster(detectCores(), type = "PSOCK")
file_list <- list.files(path=dir,recursive=FALSE,full.names=TRUE)
part <- clusterSplit(cl,seq_along(file_list))
files.partitioned <- lapply(part,function(p) file_list[p])
results <- clusterApply(cl,files.partitioned,worker.function)
result <- Reduce('+',results)
Essentially, I am wondering if trying to read files in parallel would be done in an interleaved fashion instead. And if, as a result, this bottleneck would cut down on the expected performance of running tasks in parallel?
Would it be better if I first read all matrices at once in a list then sent chunks of this list to each core for it to be processed? what if these matrices were much larger, would I be able to load all of them in a list at once ?
Parallel processing (in the extreme) means that all the f# processes start simultaneously and run to completion on their own. If we have a single computer at our disposal and have to run n models, each taking s seconds, the total running time will be n*s .
There are various packages in R which allow parallelization. “parallel” Package The parallel package in R can perform tasks in parallel by providing the ability to allocate cores to R. The working involves finding the number of cores in the system and allocating all of them or a subset to make a cluster.
Running R code in parallel can be very useful in speeding up performance. Basically, parallelization allows you to run multiple processes in your code simultaneously, rather than than iterating over a list one element at a time, or running a single process at a time.
Instead of saving each matrix
in a separate RDS file, have you tried saving a list
of N matrices in each file, where N is the number that is going to be processed by a single worker?
Then the worker.function
looks like:
worker.function <- function(file) {
matrix_list <- readRDS(file)
partial_results <- lapply(matrix_list, function(mat) sum(diag(mat)))
Reduce('+',partial.results)
}
You should save some time on I/O and maybe even on computation by replacing a for
with a lapply
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With