Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing a large model - trying to exploit parallelism

I have a chunk of code I've been re-writing over the past week or so to get it running as quickly as possible.

The code is modeling a diffracted laser beam and its essence is a convolution of a 640*640 kernel over many 2D 1280*1280 slices - each slice being a new position along the beam axis.

Stage one of optimizing was Compiling my functions and stage two was learning that Mathematica likes to operate with large lists of data - so passing it a 3D space of many layers at once as opposed to slices one after another.

However this ate my RAM!

Here is my current set up:

Func2[K_ , ZRange_] :=
Module[{layers = Dimensions[ZRange][[1]]},
x = ConstantArray[Table[x, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}], {layers}];
y = ConstantArray[Table[y, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}], {layers}];
z = Table[ConstantArray[z, {1281, 1281}], {z, ZRange}];

UTC = Func3[x, y, z];

Abs[ListConvolve[K, #] & /@ UTC]
] 


Func3 = Compile[{{x, _Real}, {y, _Real}, {z, _Real}},
Module[{Sr2R2 = Sqrt[x^2 + y^2 + z^2]},
0.5 (1. + z/Sr2R2) Exp[2 \[Pi] I (Sr2R2 - z)]/Sr2R2],
RuntimeAttributes -> {Listable},
CompilationTarget -> "C"
];


ZRangeList = {{20., 19., 18., 17., 16., 15., 14., 13., 12., 11.},
               {10., 9., 8., 7., 6., 5., 4., 3., 2., 1.}};


results = Table[Func2[kernel, ZList], {ZList, ZRangeList}];

Some explanations:

  • The work is split into two functions as I want to be able to compile as much as possible.
  • The Z values are split into a list of lists to make the functions evaluate several layers at once.

Some Questions:

  • How would you make this faster?
  • When run as is, both my cores are used but by one mathematica kernel. If i run it in with ParallelTable it runs multiple kernels but eats more RAM and is ultimately slower.
  • I would like to be able to run it on as many cores as possible - I have a LightweightGrid running - how can I do this?
  • Why can't I pass a Compiled function lists of different dimensions?
like image 644
Quantum_Oli Avatar asked Feb 12 '26 00:02

Quantum_Oli


2 Answers

Thing that jumps out instantly at me is

Abs[ListConvolve[K, #] & /@ UTC] could be made into ParallelMap[Abs@ListConvolve[K, #] & , UTC]

However, I'm really surprised that ParallelTable is slower than plain table since that's only the case in 2 situations: it's more expensive to parallelize than to performa a task, or parallelization requires too much communication between sub-kernels.

Have you distributed your definitions when you've parallelized? E.g. for above, you'd LaunchKernels first, before you even start, and then distribute definitions of K (UTC doesn't nneed to be distributed, since it's not actually used in sub-kernels, rather its parts are. See if you can make use of Share[] as well to reduce memory load.

Have you thought about doing this with CUDA? Seems perfect for the simple numeric maths that you're doing inside the functions.

Also notice, you're constantly re-creating this table: Table[x, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}], why not make it a variable, and create a ConstantArray of the value of that variable? You're wasting about 0.2 seconds on every one of those.

Lastly, a tiny little quirk: division is always a terrible thing to do when you're trying to optimize - it's time consuming:

Module[{Sr2R2 = Sqrt[x^2 + y^2 + z^2]},
       0.5 (1. + z/Sr2R2) Exp[2 \[Pi] I (Sr2R2 - z)]/Sr2R2]

can be made a hair better as (feel free to check my math):

Module[{R2=N[x^2 + y^2 + z^2],Sr2R2 = Sqrt[R2]}, 
       (0.5 Exp[2 I \[Pi] (Sr2R2 - z)] (Sr2R2 + z))/R2]
like image 129
Gregory Klopper Avatar answered Feb 15 '26 12:02

Gregory Klopper


Neither parallelization nor even the C-compilation (using gcc 4.7 from equation.com augmented by VC++Express on Windows 64 bit) does improve timings.

Running this code needs about 6.5 seconds:

    $start = AbsoluteTime[];
Func2[K_, ZRange_] := 
 Module[{layers = Dimensions[ZRange][[1]], x, y, z, UTC, tx, ty, t1},
  tx = Table[x, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}];
  ty = Table[y, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}];
  x = ConstantArray[tx, {layers}];
  y = ConstantArray[ty, {layers}];
  z = Table[ConstantArray[z, {1281, 1281}], {z, ZRange}];
  t1 = AbsoluteTime[];
  UTC = Func3[x, y, z];
  Print["Func3 time = ", AbsoluteTime[] - t1];
  Abs[ListConvolve[K, #] & /@ UTC]]
Func3 = Compile[{{x, _Real, 3}, {y, _Real, 3}, {z, _Real, 3}}, 
   Module[{Sr2R2 = Sqrt[x^2 + y^2 + z^2]}, 
    0.5 (1. + z/Sr2R2) Exp[2 \[Pi] I (Sr2R2 - z)]/Sr2R2]];
ZRangeList = {{20., 19., 18., 17., 16., 15., 14., 13., 12., 
    11.}, {10., 9., 8., 7., 6., 5., 4., 3., 2., 1.}};
SeedRandom[1]; kernel = RandomReal[{-1, 1}, {640, 640}];
results1 = Table[Func2[kernel, ZList], {ZList, ZRangeList}];
AbsoluteTime[] - $start

and compiling everything into one function is slower (8.1 sec):

$start = AbsoluteTime[]; 
CFunc2 = Compile[{{kern, _Real, 2}, {ZRange, _Real, 1}}, 
    Module[{layers = Length[ZRange], x, y, z, UTC, ty, Sr2R2}, 
     ty = Table[y, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}]; 
      x = Table[x, {layers}, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}]; 
      y = Table[y, {layers}, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}]; 
      z = Table[ConstantArray[z, {1281, 1281}], {z, ZRange}]; 
      Sr2R2 = Sqrt[x^2 + y^2 + z^2]; UTC = 0.5*(1. + z/Sr2R2)*
        (Exp[2*Pi*I*(Sr2R2 - z)]/Sr2R2); 
      Abs[(ListConvolve[kern, #1] & ) /@ UTC]]]; 
ZRangeList = {{20., 19., 18., 17., 16., 15., 14., 13., 12., 11.}, 
    {10., 9., 8., 7., 6., 5., 4., 3., 2., 1.}}; 
SeedRandom[1]; kernel = RandomReal[{-1, 1}, {640, 640}]; 
results = Table[CFunc2[kernel, ZList], {ZList, ZRangeList}]; 
AbsoluteTime[] - $start

It is usually not so easy to figure out when ParallelTable and friends really help. Just depends on the problem, size, Mathematica verison, etc.

like image 41
Rolf Mertig Avatar answered Feb 15 '26 11:02

Rolf Mertig



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!