I have a chunk of code I've been re-writing over the past week or so to get it running as quickly as possible.
The code is modeling a diffracted laser beam and its essence is a convolution of a 640*640 kernel over many 2D 1280*1280 slices - each slice being a new position along the beam axis.
Stage one of optimizing was Compiling my functions and stage two was learning that Mathematica likes to operate with large lists of data - so passing it a 3D space of many layers at once as opposed to slices one after another.
However this ate my RAM!
Here is my current set up:
Func2[K_ , ZRange_] :=
Module[{layers = Dimensions[ZRange][[1]]},
x = ConstantArray[Table[x, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}], {layers}];
y = ConstantArray[Table[y, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}], {layers}];
z = Table[ConstantArray[z, {1281, 1281}], {z, ZRange}];
UTC = Func3[x, y, z];
Abs[ListConvolve[K, #] & /@ UTC]
]
Func3 = Compile[{{x, _Real}, {y, _Real}, {z, _Real}},
Module[{Sr2R2 = Sqrt[x^2 + y^2 + z^2]},
0.5 (1. + z/Sr2R2) Exp[2 \[Pi] I (Sr2R2 - z)]/Sr2R2],
RuntimeAttributes -> {Listable},
CompilationTarget -> "C"
];
ZRangeList = {{20., 19., 18., 17., 16., 15., 14., 13., 12., 11.},
{10., 9., 8., 7., 6., 5., 4., 3., 2., 1.}};
results = Table[Func2[kernel, ZList], {ZList, ZRangeList}];
Some explanations:
Some Questions:
Thing that jumps out instantly at me is
Abs[ListConvolve[K, #] & /@ UTC] could be made into ParallelMap[Abs@ListConvolve[K, #] & , UTC]
However, I'm really surprised that ParallelTable is slower than plain table since that's only the case in 2 situations: it's more expensive to parallelize than to performa a task, or parallelization requires too much communication between sub-kernels.
Have you distributed your definitions when you've parallelized? E.g. for above, you'd LaunchKernels first, before you even start, and then distribute definitions of K (UTC doesn't nneed to be distributed, since it's not actually used in sub-kernels, rather its parts are. See if you can make use of Share[] as well to reduce memory load.
Have you thought about doing this with CUDA? Seems perfect for the simple numeric maths that you're doing inside the functions.
Also notice, you're constantly re-creating this table: Table[x, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}], why not make it a variable, and create a ConstantArray of the value of that variable? You're wasting about 0.2 seconds on every one of those.
Lastly, a tiny little quirk: division is always a terrible thing to do when you're trying to optimize - it's time consuming:
Module[{Sr2R2 = Sqrt[x^2 + y^2 + z^2]},
0.5 (1. + z/Sr2R2) Exp[2 \[Pi] I (Sr2R2 - z)]/Sr2R2]
can be made a hair better as (feel free to check my math):
Module[{R2=N[x^2 + y^2 + z^2],Sr2R2 = Sqrt[R2]},
(0.5 Exp[2 I \[Pi] (Sr2R2 - z)] (Sr2R2 + z))/R2]
Neither parallelization nor even the C-compilation (using gcc 4.7 from equation.com augmented by VC++Express on Windows 64 bit) does improve timings.
Running this code needs about 6.5 seconds:
$start = AbsoluteTime[];
Func2[K_, ZRange_] :=
Module[{layers = Dimensions[ZRange][[1]], x, y, z, UTC, tx, ty, t1},
tx = Table[x, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}];
ty = Table[y, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}];
x = ConstantArray[tx, {layers}];
y = ConstantArray[ty, {layers}];
z = Table[ConstantArray[z, {1281, 1281}], {z, ZRange}];
t1 = AbsoluteTime[];
UTC = Func3[x, y, z];
Print["Func3 time = ", AbsoluteTime[] - t1];
Abs[ListConvolve[K, #] & /@ UTC]]
Func3 = Compile[{{x, _Real, 3}, {y, _Real, 3}, {z, _Real, 3}},
Module[{Sr2R2 = Sqrt[x^2 + y^2 + z^2]},
0.5 (1. + z/Sr2R2) Exp[2 \[Pi] I (Sr2R2 - z)]/Sr2R2]];
ZRangeList = {{20., 19., 18., 17., 16., 15., 14., 13., 12.,
11.}, {10., 9., 8., 7., 6., 5., 4., 3., 2., 1.}};
SeedRandom[1]; kernel = RandomReal[{-1, 1}, {640, 640}];
results1 = Table[Func2[kernel, ZList], {ZList, ZRangeList}];
AbsoluteTime[] - $start
and compiling everything into one function is slower (8.1 sec):
$start = AbsoluteTime[];
CFunc2 = Compile[{{kern, _Real, 2}, {ZRange, _Real, 1}},
Module[{layers = Length[ZRange], x, y, z, UTC, ty, Sr2R2},
ty = Table[y, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}];
x = Table[x, {layers}, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}];
y = Table[y, {layers}, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}];
z = Table[ConstantArray[z, {1281, 1281}], {z, ZRange}];
Sr2R2 = Sqrt[x^2 + y^2 + z^2]; UTC = 0.5*(1. + z/Sr2R2)*
(Exp[2*Pi*I*(Sr2R2 - z)]/Sr2R2);
Abs[(ListConvolve[kern, #1] & ) /@ UTC]]];
ZRangeList = {{20., 19., 18., 17., 16., 15., 14., 13., 12., 11.},
{10., 9., 8., 7., 6., 5., 4., 3., 2., 1.}};
SeedRandom[1]; kernel = RandomReal[{-1, 1}, {640, 640}];
results = Table[CFunc2[kernel, ZList], {ZList, ZRangeList}];
AbsoluteTime[] - $start
It is usually not so easy to figure out when ParallelTable and friends really help. Just depends on the problem, size, Mathematica verison, etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With