I'm new to cluster processing, and could use some advice as to how better to prepare data and/or the calls to functions from the parallel
package. I have read thru the parallels
package vignettes, so have a vague idea what's going on.
The function I want to parallelize calls the 2-D interpolation tool akima::interp
. My input consists of 3 matrices (or vectors -- all the same in R
): one contains the x-coordinates, one the y-coordinates, and one the "z", or data values, for a set of sample points. interp
uses this to produce interpolated data on a regular grid so I can, e.g., plot the field. Once I have these 3 items set up, I cut them into "chunks" and feed them to clusterApply
to execute interp
chunk by chunk.
I'm using a Windows7, i7 CPU (8-core) machine. Here's the summary output from Rprof
for an input data set with 1e6 points (1000x1000 if you like), and mapped onto a 1000x1000 output grid.
So my questions are:
1) It appears that "unserialize" is taking most of the time. What is this operation, and how could it be reduced?
2) In general, since each worker loads the default .Rdata
file, is there any speed gained if I first save all input data to .Rdata
so that it doesn't need to get passed to the workers?
3) Anything else that I'm simply unaware of that I should have done differently?
Note: the sin, atan2, cos, +, max, min
functions take place prior to the clusterApply
call I make.
Rgames> summaryRprof('bigprof.txt')
$by.self
self.time self.pct total.time total.pct
"unserialize" 329.04 99.11 329.04 99.11
"socketConnection" 1.74 0.52 1.74 0.52
"serialize" 0.96 0.29 0.96 0.29
"sin" 0.06 0.02 0.06 0.02
"atan2" 0.04 0.01 0.06 0.02
"cos" 0.04 0.01 0.04 0.01
"+" 0.02 0.01 0.02 0.01
"max" 0.02 0.01 0.02 0.01
"min" 0.02 0.01 0.02 0.01
"row" 0.02 0.01 0.02 0.01
"writeLines" 0.02 0.01 0.02 0.01
$by.total
total.time total.pct self.time self.pct
"mcswirl" 331.98 100.00 0.00 0.00
"clusterApply" 330.00 99.40 0.00 0.00
"staticClusterApply" 330.00 99.40 0.00 0.00
"FUN" 329.06 99.12 0.00 0.00
"unserialize" 329.04 99.11 329.04 99.11
"lapply" 329.04 99.11 0.00 0.00
"recvData" 329.04 99.11 0.00 0.00
"recvData.SOCKnode" 329.04 99.11 0.00 0.00
"makeCluster" 1.76 0.53 0.00 0.00
"makePSOCKcluster" 1.76 0.53 0.00 0.00
"newPSOCKnode" 1.76 0.53 0.00 0.00
"socketConnection" 1.74 0.52 1.74 0.52
"serialize" 0.96 0.29 0.96 0.29
"postNode" 0.96 0.29 0.00 0.00
"sendCall" 0.96 0.29 0.00 0.00
"sendData" 0.96 0.29 0.00 0.00
"sendData.SOCKnode" 0.96 0.29 0.00 0.00
"sin" 0.06 0.02 0.06 0.02
"atan2" 0.06 0.02 0.04 0.01
"cos" 0.04 0.01 0.04 0.01
"+" 0.02 0.01 0.02 0.01
"max" 0.02 0.01 0.02 0.01
"min" 0.02 0.01 0.02 0.01
"row" 0.02 0.01 0.02 0.01
"writeLines" 0.02 0.01 0.02 0.01
"outer" 0.02 0.01 0.00 0.00
"system" 0.02 0.01 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 331.98
When clusterApply
is called, it first sends a task to each of the cluster workers, and then waits for each of them to return the corresponding result. If there are more tasks to do, it repeats that procedure until all of the tasks are complete.
The function that it uses to wait for a result from a particular worker is recvResult
which ultimately calls unserialize
to read data from the socket that is connected to that worker. So if the master process is spending most of its time in unserialize
, then it is spending most of its time waiting for the cluster workers to return the task results, which is what you would hope to see on the master. If it was spending a lot of time in serialize
, that would mean that it was spending a lot of time sending the tasks to the workers, which would be a bad sign.
Unfortunately, you can't tell how much time unserialize
spends blocking, waiting for the result data to arrive, and how much time it spends actually transferring that data. The results might be easily computed by the workers and huge, or they might take a long time to compute and be tiny: there's no way to tell from the profiling data.
So to make unserialize
execute faster, you need to make the workers compute their results faster, or make the results smaller, if that's possible. In addition, it might help to use the makeCluster
useXDR=FALSE
option. It might improve your performance by not using XDR to encode your data, making both serialize
and unserialize
faster.
I don't think it will help to save all input data to .Rdata
since you're not spending much time sending data to the workers, as seen by the short time spent in the serialize
function. I suspect that would slow you down a little bit.
The only other advice I can think of is to try using parLapply
or clusterApplyLB
, rather than clusterApply
. I recommend using parLapply
unless you have a specific reason to use one of the other functions since parLapply
is often the most efficient. clusterApplyLB
is useful when you have tasks that take a long but variable length of time to execute.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With