I wanted to write an efficient implementation of the Floyd-Warshall all pairs shortest path algorithm in Haskell using Vector
s to hopefully get good performance.
The implementation is quite straight-forward, but instead of using a 3-dimensional |V|×|V|×|V| matrix, a 2-dimensional vector is used, since we only ever read the previous k
value.
Thus, the algorithm is really just a series of steps where a 2D vector is passed in, and a new 2D vector is generated. The final 2D vector contains the shortest paths between all nodes (i,j).
My intuition told me that it would be important to make sure that the previous 2D vector was evaluated before each step, so I used BangPatterns
on the prev
argument to the fw
function and the strict foldl'
:
{-# Language BangPatterns #-}
import Control.DeepSeq
import Control.Monad (forM_)
import Data.List (foldl')
import qualified Data.Map.Strict as M
import Data.Vector (Vector, (!), (//))
import qualified Data.Vector as V
import qualified Data.Vector.Mutable as V hiding (length, replicate, take)
type Graph = Vector (M.Map Int Double)
type TwoDVector = Vector (Vector Double)
infinity :: Double
infinity = 1/0
-- calculate shortest path between all pairs in the given graph, if there are
-- negative cycles, return Nothing
allPairsShortestPaths :: Graph -> Int -> Maybe TwoDVector
allPairsShortestPaths g v =
let initial = fw g v V.empty 0
results = foldl' (fw g v) initial [1..v]
in if negCycle results
then Nothing
else Just results
where -- check for negative elements along the diagonal
negCycle a = any not $ map (\i -> a ! i ! i >= 0) [0..(V.length a-1)]
-- one step of the Floyd-Warshall algorithm
fw :: Graph -> Int -> TwoDVector -> Int -> TwoDVector
fw g v !prev k = V.create $ do -- ← bang
curr <- V.new v
forM_ [0..(v-1)] $ \i ->
V.write curr i $ V.create $ do
ivec <- V.new v
forM_ [0..(v-1)] $ \j -> do
let d = distance g prev i j k
V.write ivec j d
return ivec
return curr
distance :: Graph -> TwoDVector -> Int -> Int -> Int -> Double
distance g _ i j 0 -- base case; 0 if same vertex, edge weight if neighbours
| i == j = 0.0
| otherwise = M.findWithDefault infinity j (g ! i)
distance _ a i j k = let c1 = a ! i ! j
c2 = (a ! i ! (k-1))+(a ! (k-1) ! j)
in min c1 c2
However, when running this program with a 1000-node graph with 47978 edges, things does not look good at all. The memory usage is very high and the program takes way too long to run. The program was compiled with ghc -O2
.
I rebuilt the program for profiling, and limited the number of iterations to 50:
results = foldl' (fw g v) initial [1..50]
I then ran the program with +RTS -p -hc
and +RTS -p -hd
:
This is... interesting, but I guess it's showing that it's accumulating tonnes of thunks. Not good.
Ok, so after a few shots in the dark, I added a deepseq
in fw
to make sure prev
really is evaluted:
let d = prev `deepseq` distance g prev i j k
Now things look better, and I can actually run the program to completion with constant memory usage. It's obvious that the bang on the prev
argument was not enough.
For comparison with the previous graphs, here is the memory usage for 50 iterations after adding the deepseq
:
Ok, so things are better, but I still have some questions:
deepseq
is a bit ugly?Vector
s here idiomatic/correct? I'm building a completely new vector for every iteration and hoping that the garbage collector will delete the old Vector
s.For references, here is graph.txt
: http://sebsauvage.net/paste/?45147f7caf8c5f29#7tiCiPovPHWRm1XNvrSb/zNl3ujF3xB3yehrxhEdVWw=
Here is main
:
main = do
ls <- fmap lines $ readFile "graph.txt"
let numVerts = head . map read . words . head $ ls
let edges = map (map read . words) (tail ls)
let g = V.create $ do
g' <- V.new numVerts
forM_ [0..(numVerts-1)] (\idx -> V.write g' idx M.empty)
forM_ edges $ \[f,t,w] -> do
-- subtract one from vertex IDs so we can index directly
curr <- V.read g' (f-1)
V.write g' (f-1) $ M.insert (t-1) (fromIntegral w) curr
return g'
let a = allPairsShortestPaths g numVerts
case a of
Nothing -> putStrLn "Negative cycle detected."
Just a' -> do
putStrLn $ "The shortest, shortest path has length "
++ show ((V.minimum . V.map V.minimum) a')
Note however that Floyd-Warshall does very few operations in the inner-loop so in practice Floyd-Warshall will likely run faster than Dijkstra for All-Pairs Shortest Path. Note that E < V^2 is true, since complete graph has (V*V-1)/2 edges (or twice that if directed).
The overall time complexity of the Floyd Warshall algorithm is O ( n 3 ) O(n^{3}) O(n3) where n denotes the number of nodes in the graph.
The credit of Floyd-Warshall Algorithm goes to Robert Floyd, Bernard Roy and Stephen Warshall. Limitations: The graph should not contain negative cycles. The graph can have positive and negative weight edges.
While Floyd-Warshall is efficient for dense graphs, if the graph is sparse then an alternative all pairs shortest path strategy known as Johnson's algorithm can be used.
First, some general code cleanup:
In your fw
function, you explicitly allocate and fill mutable vectors. However, there is a premade function for this exact purpose, namely generate
. fw
can therefore be rewritten as
V.generate v (\i -> V.generate v (\j -> distance g prev i j k))
Similarly, the graph generation code can be replaced with replicate
and accum
:
let parsedEdges = map (\[f,t,w] -> (f - 1, (t - 1, fromIntegral w))) edges
let g = V.accum (flip (uncurry M.insert)) (V.replicate numVerts M.empty) parsedEdges
Note that this totally removes all need for mutation, without losing any performance.
Now, to the actual questions:
In my experience, deepseq
is very useful, but only as quick fix to space leaks like this one. The fundamental problem is not that you need to force the results after you've produced them. Instead, the use of deepseq
implies that you should have been building the structure more strictly in the first place. In fact, if you add a bang pattern in your vector creation code like so:
let !d = distance g prev i j k
Then the problem is fixed without deepseq
. Note that this doesn't work with the generate
code, because, for some reason (I might create a feature request for this), vector
does not provide strict functions for boxed vectors. However, when I get to unboxed vectors in answer to question 3, which are strict, both approaches work without strictness annotations.
As far as I know, the pattern of repeatedly generating new vectors is idiomatic. The only thing not idiomatic is the use of mutability - except when they are strictly necessary, mutable vectors are generally discouraged.
There are a couple of things to do:
Most simply, you can replace Map Int
with IntMap
. As that isn't really the slow point of the function, this doesn't matter too much, but IntMap
can be much faster for heavy workloads.
You can switch to using unboxed vectors. Although the outer vector has to remain boxed, as vectors of vectors can't be unboxed, the inner vector can be. This also solves your strictness problem - because unboxed vectors are strict in their elements, you don't get a space leak. Note that on my machine, this improves the performance from 4.1 seconds to 1.3 seconds, so the unboxing is very helpful.
You can flatten the vector into a single one and use multiplication and division to switch between two dimensional indicies and one dimentional indicies. I don't recommend this, as it is a bit involved, quite ugly, and, due to the division, actually slows down the code on my machine.
You can use repa
. This has the huge advantage of automatically parallelizing your code. Note that, since repa
flattens its arrays and apparently doesn't properly get rid of the divisions needed to fill nicely (it's possible to do with nested loops, but I think it uses a single loop and a division), it has the same performance penalty as I mentioned above, bringing the runtime from 1.3 seconds to 1.8. However, if you enable parallelism and use a multicore machine, you start seeing some benifits. Unfortunately, you current test case is too tiny to see much benifit, so, on my 6 core machine, I see it drop back down to 1.2 seconds. If I up the size back to [1..v]
instead of [1..50]
, the parallelism brings it from 32 seconds to 13. Presumably, if you give this program a larger input, you might see more benifit.
If you're interested, I've posted my repa
-ified version here.
EDIT: Use -fllvm
. Testing on my computer, using repa
, I get 14.7 seconds without parallelism, which is almost as good as without -fllvm
and with parallelism. In general, LLVM can just handle array based code like this very well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With