I find the array library Repa for Haskell very interesting, and wanted to make a simple program, to try to understand how to use it. I also made a simple implementation using lists, which proved to be much faster. My main question is how I could improve the Repa code below to make it the most efficient (and hopefully also very readable). I am quite new using Haskell, and I couldn't find any easily understandable tutorial on Repa [edit there is one at the Haskell Wiki, that I somehow forgot when I wrote this], so don't assume I know anything. :) For example, I'm not sure when to use force or deepSeqArray.
The program is used to approximately calculate the volume of a sphere in the following way:
Two versions are shown below, one using lists and one using repa. I know the code is inefficient, especially for this use case, but the idea is to make it more complicated later on.
For the values below, and compiling with "ghc -Odph -fllvm -fforce-recomp -rtsopts -threaded", the list version takes 1.4 s, while the repa version takes 12 s with +RTS -N1 and 10 s with +RTS -N2, though no sparks are converted (I have a dual-core Intel machine (Core 2 Duo E7400 @ 2.8 GHz) running Windows 7 64, GHC 7.0.2 and llvm 2.8). (Comment out the correct line in main below to just run one of the versions.)
Thank you for any help!
import Data.Array.Repa as R
import qualified Data.Vector.Unboxed as V
import Prelude as P
-- Calculate the volume of a sphere by putting it in a bath of coordinates. Generate coordinates (x,y,z) in a cuboid. Then, for each coordinate, check if it is inside the sphere. Sum those coordinates and multiply by the coordinate grid step size to find an approximate volume.
particles = [(0,0,0)] -- used for the list alternative --[(0,0,0),(0,2,0)]
particles_repa = [0,0,0::Double] -- used for the repa alternative, can currently just be one coordinate
-- Radius of the particle
a = 4
-- Generate the coordinates. Could this be done more efficiently, and at the same time simple? In Matlab I would use ndgrid.
step = 0.1 --0.05
xrange = [-10,-10+step..10] :: [Double]
yrange = [-10,-10+step..10]
zrange = [-10,-10+step..10]
-- All coordinates as triples. These are used directly in the list version below.
coords = [(x,y,z) | x <- xrange, y <- yrange, z <- zrange]
---- List code ----
volumeIndividuals = fromIntegral (length particles) * 4*pi*a**3/3
volumeInside = step**3 * fromIntegral (numberInsideParticles particles coords)
numberInsideParticles particles coords = length $ filter (==True) $ P.map (insideParticles particles) coords
insideParticles particles coord = any (==True) $ P.map (insideParticle coord) particles
insideParticle (xc,yc,zc) (xp,yp,zp) = ((xc-xp)^2+(yc-yp)^2+(zc-zp)^2) < a**2
---- End list code ----
---- Repa code ----
-- Put the coordinates in a Nx3 array.
xcoords = P.map (\(x,_,_) -> x) coords
ycoords = P.map (\(_,y,_) -> y) coords
zcoords = P.map (\(_,_,z) -> z) coords
-- Total number of coordinates
num_coords = (length xcoords) ::Int
xcoords_r = fromList (Z :. num_coords :. (1::Int)) xcoords
ycoords_r = fromList (Z :. num_coords :. (1::Int)) ycoords
zcoords_r = fromList (Z :. num_coords :. (1::Int)) zcoords
rcoords = xcoords_r R.++ ycoords_r R.++ zcoords_r
-- Put the particle coordinates in an array, then extend (replicate) this array so that its size becomes the same as that of rcoords
particle = fromList (Z :. (1::Int) :. (3::Int)) particles_repa
particle_slice = slice particle (Z :. (0::Int) :. All)
particle_extended = extend (Z :. num_coords :. All) particle_slice
-- Calculate the squared difference between the (x,y,z) coordinates of the particle and the coordinates of the cuboid.
squared_diff = deepSeqArrays [rcoords,particle_extended] ((force2 rcoords) -^ (force2 particle_extended)) **^ 2
(**^) arr pow = R.map (**pow) arr
xslice = slice squared_diff (Z :. All :. (0::Int))
yslice = slice squared_diff (Z :. All :. (1::Int))
zslice = slice squared_diff (Z :. All :. (2::Int))
-- Calculate the distance between each coordinate and the particle center
sum_squared_diff = [xslice,yslice,zslice] `deepSeqArrays` xslice +^ yslice +^ zslice
-- Do the rest using vector, since I didn't get the repa variant working.
ssd_vec = toVector sum_squared_diff
-- Determine the number of the coordinates that are within the particle (instead of taking the square root to get the distances above, I compare to the square of the radius here, to improve performance)
total_within = fromIntegral (V.length $ V.filter (<a**2) ssd_vec)
--total_within = foldAll (\x acc -> if x < a**2 then acc+1 else acc) 0 sum_squared_diff
-- Finally, calculate an approximation of the volume of the sphere by taking the volume of the cubes with side step, multiplied with the number of coordinates within the sphere.
volumeInside_repa = step**3 * total_within
-- Helper function that shows the size of a 2-D array.
rsize = reverse . listOfShape . (extent :: Array DIM2 Double -> DIM2)
---- End repa code ----
-- Comment out the list or the repa version if you want to time the calculations separately.
main = do
putStrLn $ "Step = " P.++ show step
putStrLn $ "Volume of individual particles = " P.++ show volumeIndividuals
putStrLn $ "Volume of cubes inside particles (list) = " P.++ show volumeInside
putStrLn $ "Volume of cubes inside particles (repa) = " P.++ show volumeInside_repa
Edit: Some background that explains why I have written the code as it is above:
I mostly write code in Matlab, and my experience of performance improvement comes mostly from that area. In Matlab, you usually want to make your calculations using functions operating on matrices directly, to improve performance. My implementation of the problem above, in Matlab R2010b, takes 0.9 seconds using the matrix version shown below, and 15 seconds using nested loops. Although I know Haskell is very different from Matlab, my hope was that going from using lists to using Repa arrays in Haskell would improve the performance of the code. The conversions from lists->Repa arrays->vectors are there because I'm not skilled enough to replace them with something better. This is why I ask for input. :) The timing numbers above is thus subjective, since it may measure my performance more than that of the abilities of the languages, but it is a valid metric for me right now, since what decides what I will use depends on if I can make it work or not.
tl;dr: I understand that my Repa code above may be stupid or pathological, but it's the best I can do right now. I would love to be able to write better Haskell code, and I hope that you can help me in that direction (dons already did). :)
function archimedes_simple()
particles = [0 0 0]';
a = 4;
step = 0.1;
xrange = [-10:step:10];
yrange = [-10:step:10];
zrange = [-10:step:10];
[X,Y,Z] = ndgrid(xrange,yrange,zrange);
dists2 = bsxfun(@minus,X,particles(1)).^2+ ...
bsxfun(@minus,Y,particles(2)).^2+ ...
bsxfun(@minus,Z,particles(3)).^2;
inside = dists2 < a^2;
num_inside = sum(inside(:));
disp('');
disp(['Step = ' num2str(step)]);
disp(['Volume of individual particles = ' num2str(size(particles,2)*4*pi*a^3/3)]);
disp(['Volume of cubes inside particles = ' num2str(step^3*num_inside)]);
end
Edit 2: New, faster and simpler version of the Repa code
I have now read up a bit more on Repa, and thought a bit. Below is a new Repa version. In this case, I create the x, y, and z coordinates as 3-D arrays, using the Repa extend function, from a list of values (similar to how ndgrid works in Matlab). I then map over these arrays to calculate the distance to the spherical particle. Finally, I fold over the resulting 3-D distance array, count how many coordinates are within the sphere, and then multiply it by a constant factor to get the approximate volume. My implementation of the algorithm is now much more similar to the Matlab version above, and there are no longer any conversion to vector.
The new version runs in about 5 seconds on my computer, a considerable improvement from above. The timing is the same if I use "threaded" while compiling, combined with "+RTS -N2" or not, but the threaded version does max out both cores of my computer. I did, however, see a few drops of the "-N2" run to 3.1 seconds, but couldn't reproduce them later. Maybe it is very sensitive to other processes running at the same time? I have shut most programs on my computer when benchmarking, but there are still some programs running, such as background processes.
If we use "-N2" and add the runtime switch to turn off parallel GC (-qg), the time consistently goes down to ~4.1 seconds, and using -qa to "use the OS to set thread affinity (experimental)", the time was shaved down to ~3.5 seconds. Looking at the output from running the program with "+RTS -s", much less GC is performed using -qg.
This afternoon I will see if I can run the code on an 8-core computer, just for fun. :)
import Data.Array.Repa as R
import Prelude as P
import qualified Data.List as L
-- Calculate the volume of a spherical particle by putting it in a bath of coordinates. Generate coordinates (x,y,z) in a cuboid. Then, for each coordinate, check if it is inside the sphere. Sum those coordinates and multiply by the coordinate grid step size to find an approximate volume.
particles :: [(Double,Double,Double)]
particles = [(0,0,0)]
-- Radius of the spherical particle
a = 4
volume_individuals = fromIntegral (length particles) * 4*pi*a^3/3
-- Generate the coordinates.
step = 0.1
coords_list = [-10,-10+step..10] :: [Double]
num_coords = (length coords_list) :: Int
coords :: Array DIM1 Double
coords = fromList (Z :. (num_coords ::Int)) coords_list
coords_slice :: Array DIM1 Double
coords_slice = slice coords (Z :. All)
-- x, y and z are 3-D arrays, where the same index into each array can be used to find a single coordinate, e.g. (x(i,j,k),y(i,j,k),z(i,j,k)).
x,y,z :: Array DIM3 Double
x = extend (Z :. All :. num_coords :. num_coords) coords_slice
y = extend (Z :. num_coords :. All :. num_coords) coords_slice
z = extend (Z :. num_coords :. num_coords :. All) coords_slice
-- Calculate the squared distance from each coordinate to the center of the spherical particle.
dist2 :: (Double, Double, Double) -> Array DIM3 Double
dist2 particle = ((R.map (squared_diff xp) x) + (R.map (squared_diff yp) y) + (R.map ( squared_diff zp) z))
where
(xp,yp,zp) = particle
squared_diff xi xa = (xa-xi)^2
-- Count how many of the coordinates are within the spherical particle.
num_inside_particle :: (Double,Double,Double) -> Double
num_inside_particle particle = foldAll (\acc x -> if x<a^2 then acc+1 else acc) 0 (force $ dist2 particle)
-- Calculate the approximate volume covered by the spherical particle.
volume_inside :: [Double]
volume_inside = P.map ((*step^3) . num_inside_particle) particles
main = do
putStrLn $ "Step = " P.++ show step
putStrLn $ "Volume of individual particles = " P.++ show volume_individuals
putStrLn $ "Volume of cubes inside each particle (repa) = " P.++ (P.concat . ( L.intersperse ", ") . P.map show) volume_inside
-- As an alternative, y and z could be generated from x, but this was slightly slower in my tests (~0.4 s).
--y = permute_dims_3D x
--z = permute_dims_3D y
-- Permute the dimensions in a 3-D array, (forward, cyclically)
permute_dims_3D a = backpermute (swap e) swap a
where
e = extent a
swap (Z :. i:. j :. k) = Z :. k :. i :. j
Space profiling for the new code
The same types of profiles as Don Stewart made below, but for the new Repa code.
Code Review Notes
rsize
isn't used (looks expensive!)(**)
could be replaced by the cheaper (^)
on Int
.any (==True)
is the same as or
Time profiling
COST CENTRE MODULE %time %alloc
squared_diff Main 25.0 27.3
insideParticle Main 13.8 15.3
sum_squared_diff Main 9.8 5.6
rcoords Main 7.4 5.6
particle_extended Main 6.8 9.0
particle_slice Main 5.0 7.6
insideParticles Main 5.0 4.4
yslice Main 3.6 3.0
xslice Main 3.0 3.0
ssd_vec Main 2.8 2.1
**^ Main 2.6 1.4
shows that, your function squared_diff
is a bit suspicious:
squared_diff :: Array DIM2 Double
squared_diff = deepSeqArrays [rcoords,particle_extended]
((force2 rcoords) -^ (force2 particle_extended)) **^ 2
though I don't see any obvious fix.
Space profiling
Nothing too amazing in the space profile: you clearly see the list phase, then the vector phase. The list phase allocates a lot, which gets reclaimed.
Breaking down the heap by type, we see initially a lot of lists and tuples being allocated (on demand), then a big chunk of arrays are allocated and held:
Again, kinda what we expected to see... the array stuff isn't allocating especially more than the list code (in fact, a bit less overall), but it is just taking a lot longer to run.
Checking for space leaks with retainer profiling:
There's a few interesting things there, but nothing startling. zcoords
gets retained for the length of the list program execution, then some arrays (SYSTEM) are being allocated for the repa run.
Inspecting the Core
So at this point I'm firstly assuming that you really did implement the same algorithms in lists and arrays (i.e. no extra work is being done in the array case), and there's no obvious space leak. So my suspicion is badly-optimized repa code. Let's look at the core (with ghc-core.
Inlining all the CAFs
I added inline pragmas to all the top level array definitions, in a hope to remove some of the CAfs, and get GHC to optimize the array code a bit harder. This really made GHC struggle to compile the module (allocating up to 4.3G and 10 minutes while working on it). This is a clue to me that GHC wasn't able to optimize this program well before, since there's new stuff for it to do when I increase the thresholds.
Actions
I changed the code to force rcoords
and particle_extended
, and disovered we were losing the lion's share of time within them directly:
COST CENTRE MODULE %time %alloc
rcoords Main 32.6 34.4
particle_extended Main 21.5 27.2
**^ Main 9.8 12.7
The biggest single improvement to this code would clearly be to generate those two constant inputs in a better fashion.
Note that this is basically a lazy, streaming algorithm, and where you're losing time is the sunk cost of allocating at least two 24361803-element arrays all in one go, and then probably allocating at least once or twice more or giving up sharing. The very best case for this code, I think, with a very good optimizer and a zillion rewrite rules, will be to roughly match the list version (which can also parallelize very easily).
I think dons is right that Ben & co. will be interested in this benchmark, but my overwhelming suspicion is that this is not a good use case for a strict array library, and my suspicion is that matlab is hiding some clever optimizations behind its ngrid
function (optimizations, I'll grant, which it might be useful to port to repa).]
Edit:
Here's a quick and dirty way to parallelize the list code. Import Control.Parallel.Strategies
and then write numberInsideParticles
as:
numberInsideParticles particles coords = length $ filter id $
withStrategy (parListChunk 2000 rseq) $ P.map (insideParticles particles) coords
This shows good speedup as we scale up cores (12s at one core to 3.7s at 8), but the overhead of spark creation means that even a 8 cores we only match the single core non-parallel version. I tried a few alternate strategies and got similar results. Again, I'm not sure how much better we can possibly do than a single-threaded list version here. Since the computations on each individual particle are so cheap, we're mainly stressing allocation, not computation. The big win on something like this I imagine would be vectorized computation more than anything else, and as far as I know that pretty much requires hand-coding.
Also note that the parallel version spends roughly 70% of its time in GC, while the one-core version spend 1% of its time there (i.e. the allocation is, to the extent possible, is effectively fused away.).
I've added some advice about how to optimise Repa programs to the Haskell wiki: http://www.haskell.org/haskellwiki/Numeric_Haskell:_A_Repa_Tutorial#Optimising_Repa_programs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With