I am writing matlab code to perform a 3 dimensional integral:
function [ fint ] = int3d_ser(R0, Rf, N)
Nr = N;
Nt = round(pi*N);
Np = round(2*pi*N);
rs = linspace(R0, Rf, Nr);
ts = linspace(0, pi, Nt);
ps = linspace(0, 2*pi, Np);
dr = rs(2)-rs(1);
dt = ts(2)-ts(1);
dp = ps(2)-ps(1);
C = 1/((4/3)*pi);
fint = 0.0;
for ir = 2:Nr
r = rs(ir);
r2dr = r*r*dr;
for it = 1:Nt-1
t = ts(it);
sintdt = sin(t)*dt;
for ip = 1:Np-1
p = ps(ip);
fint = fint + C*r2dr*sintdt*dp;
end
end
end
end
for the associated int3d_par
(parfor) version, I open a matlab pool and just replace the for
with a parfor
. I get pretty decent speedup with I run it on more cores (my tests are from 2 to 8 cores).
However, when I run the same integration in batch mode with:
function [fint] = int3d_batch_cluster(R0, Rf, N, cluster, ncores)
%%% note: This will not give back the same value as the serial or parpool version.
%%% If this was a legit integration, I would worry more about even dispersion
%%% of integration nodes per core, but I just want to benchmark right now so ... meh
Nr = N;
Nt = round(pi*N);
Np = round(2*pi*N);
rs = linspace(R0, Rf, Nr);
ts = linspace(0, pi, Nt);
ps = linspace(0, 2*pi, Np);
dr = rs(2)-rs(1);
dt = ts(2)-ts(1);
dp = ps(2)-ps(1);
C = 1/((4/3)*pi);
rns = floor( Nr/ncores )*ones(ncores,1);
RNS = zeros(ncores,1);
for icore = 1:ncores
if(sum(rns) ~= Nr)
rns(icore) = rns(icore)+1;
end
end
RNS(1) = rns(1);
for icore = 2:ncores
RNS(icore) = RNS(icore-1)+rns(icore);
end
rfs = rs(RNS);
r0s = zeros(ncores,1);
r0s(2:end) = rfs(1:end-1);
j = createJob(cluster);
for icore = 1:ncores
r0 = r0s(icore);
rf = rfs(icore);
rn = rns(icore);
trs = linspace(r0, rf, rn);
t{icore} = createTask(j, @int3d_ser, 1, {r0, rf, rn});
end
submit(j);
wait(j);
fints = fetchOutputs(j);
fint = 0.0;
for ifint = 1:length(fints)
fint = fint + fints{ifint};
end
end
I notice that it is much, much faster. Why would doing this integration in batch mode be different than doing it in parfor
?
For reference, I test the code with N
from small numbers like 10 and 20 (to get the constant in the polynomial approximation of runtime) to larger numbers like 1000 and 2000. This algorithm will scale cubicly since I assign the number of integration nodes in the theta
and phi
direction to be a constant multiple of the given N
.
For 2000 nodes, the parfor
version takes about 630 seconds, while the same number of nodes in batch mode takes about 19 seconds (where around 12 seconds is simply overhead communication that we also get for 10 integration nodes).
After speaking with Mathworks
support, it appears I had a fundamental misunderstanding of how parfor
works. I was under the impression that parfor
acted like openMP
whereas batch mode was acting like mpi
in terms of shared vs distributed memory.
It turns out that parfor
actually uses distributed memory as well. When I am creating, say, 4 batch functions, the overhead for creating a new process is happening 4 times. I thought that using a parfor
would cause that overhead to happen just 1 time and that the parfor
would then take place in the same memory space. This is not the case.
In my example code, it turns out that for each iteration of the parfor
, I am actually incurring the overhead of creating a new thread. When comparing 'apples to apples', I should really be creating the same number of batch calls as I am iterations in the parfor
loop. This is why the parfor
function was taking so much longer - I was incurring much more overhead for multiprocessing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With