Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to improve speed in parallel cluster processing

I'm new to cluster processing, and could use some advice as to how better to prepare data and/or the calls to functions from the parallel package. I have read thru the parallels package vignettes, so have a vague idea what's going on.

The function I want to parallelize calls the 2-D interpolation tool akima::interp . My input consists of 3 matrices (or vectors -- all the same in R): one contains the x-coordinates, one the y-coordinates, and one the "z", or data values, for a set of sample points. interp uses this to produce interpolated data on a regular grid so I can, e.g., plot the field. Once I have these 3 items set up, I cut them into "chunks" and feed them to clusterApply to execute interp chunk by chunk.

I'm using a Windows7, i7 CPU (8-core) machine. Here's the summary output from Rprof for an input data set with 1e6 points (1000x1000 if you like), and mapped onto a 1000x1000 output grid.

So my questions are: 1) It appears that "unserialize" is taking most of the time. What is this operation, and how could it be reduced? 2) In general, since each worker loads the default .Rdata file, is there any speed gained if I first save all input data to .Rdata so that it doesn't need to get passed to the workers? 3) Anything else that I'm simply unaware of that I should have done differently?

Note: the sin, atan2, cos, +, max, min functions take place prior to the clusterApply call I make.

Rgames> summaryRprof('bigprof.txt')
$by.self
                   self.time self.pct total.time total.pct
"unserialize"         329.04    99.11     329.04     99.11
"socketConnection"      1.74     0.52       1.74      0.52
"serialize"             0.96     0.29       0.96      0.29
"sin"                   0.06     0.02       0.06      0.02
"atan2"                 0.04     0.01       0.06      0.02
"cos"                   0.04     0.01       0.04      0.01
"+"                     0.02     0.01       0.02      0.01
"max"                   0.02     0.01       0.02      0.01
"min"                   0.02     0.01       0.02      0.01
"row"                   0.02     0.01       0.02      0.01
"writeLines"            0.02     0.01       0.02      0.01

$by.total
                     total.time total.pct self.time self.pct
"mcswirl"                331.98    100.00      0.00     0.00
"clusterApply"           330.00     99.40      0.00     0.00
"staticClusterApply"     330.00     99.40      0.00     0.00
"FUN"                    329.06     99.12      0.00     0.00
"unserialize"            329.04     99.11    329.04    99.11
"lapply"                 329.04     99.11      0.00     0.00
"recvData"               329.04     99.11      0.00     0.00
"recvData.SOCKnode"      329.04     99.11      0.00     0.00
"makeCluster"              1.76      0.53      0.00     0.00
"makePSOCKcluster"         1.76      0.53      0.00     0.00
"newPSOCKnode"             1.76      0.53      0.00     0.00
"socketConnection"         1.74      0.52      1.74     0.52
"serialize"                0.96      0.29      0.96     0.29
"postNode"                 0.96      0.29      0.00     0.00
"sendCall"                 0.96      0.29      0.00     0.00
"sendData"                 0.96      0.29      0.00     0.00
"sendData.SOCKnode"        0.96      0.29      0.00     0.00
"sin"                      0.06      0.02      0.06     0.02
"atan2"                    0.06      0.02      0.04     0.01
"cos"                      0.04      0.01      0.04     0.01
"+"                        0.02      0.01      0.02     0.01
"max"                      0.02      0.01      0.02     0.01
"min"                      0.02      0.01      0.02     0.01
"row"                      0.02      0.01      0.02     0.01
"writeLines"               0.02      0.01      0.02     0.01
"outer"                    0.02      0.01      0.00     0.00
"system"                   0.02      0.01      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 331.98
like image 341
Carl Witthoft Avatar asked Oct 15 '13 12:10

Carl Witthoft


1 Answers

When clusterApply is called, it first sends a task to each of the cluster workers, and then waits for each of them to return the corresponding result. If there are more tasks to do, it repeats that procedure until all of the tasks are complete.

The function that it uses to wait for a result from a particular worker is recvResult which ultimately calls unserialize to read data from the socket that is connected to that worker. So if the master process is spending most of its time in unserialize, then it is spending most of its time waiting for the cluster workers to return the task results, which is what you would hope to see on the master. If it was spending a lot of time in serialize, that would mean that it was spending a lot of time sending the tasks to the workers, which would be a bad sign.

Unfortunately, you can't tell how much time unserialize spends blocking, waiting for the result data to arrive, and how much time it spends actually transferring that data. The results might be easily computed by the workers and huge, or they might take a long time to compute and be tiny: there's no way to tell from the profiling data.

So to make unserialize execute faster, you need to make the workers compute their results faster, or make the results smaller, if that's possible. In addition, it might help to use the makeCluster useXDR=FALSE option. It might improve your performance by not using XDR to encode your data, making both serialize and unserialize faster.

I don't think it will help to save all input data to .Rdata since you're not spending much time sending data to the workers, as seen by the short time spent in the serialize function. I suspect that would slow you down a little bit.

The only other advice I can think of is to try using parLapply or clusterApplyLB, rather than clusterApply. I recommend using parLapply unless you have a specific reason to use one of the other functions since parLapply is often the most efficient. clusterApplyLB is useful when you have tasks that take a long but variable length of time to execute.

like image 73
Steve Weston Avatar answered Sep 22 '22 17:09

Steve Weston