Optimising manipulation of large vectors

Tags:

This is a follow up to my previous question about processing a Vector representation of a 5.1m edge directed graph. I am trying to implement Kosaraju's graph algorithm and thus need to rearrange my Vector in the order of the finishing times of a depth first search (DFS) on the edges reversed. I have code that runs on small data sets but that fails to return in 10 minutes on the full data set. (I can't exclude that a loop arises from the big graph, but there are no signs of that on my test data.)

DFS needs to avoid revisiting nodes, so I need some sort of 'state' for the search (currently a tuple, should I use a State Monad?). The first search should return a reordered Vector, but I am keeping things simple at present by returning a list of the reordered Node indexes so that I can process the Vector in one go subsequently.

I presume the issue lies in dfsInner. The code below 'remembers' the nodes visited updating the explored field of each node (third guard). Although I tried to make it tail recursive, the code seems to grow memory use fairly fast. Do I need to enforce some strictness and if so, how? (I have another version that I use on a single search search, which checks for previous visits by looking at the start nodes of the unexplored edges on the stack and the list of nodes that have been completed. This does not grow so quickly, but does not return for any well connected node.)

However, it could also be the foldr', but how can I detect that?

This is supposedly Coursera homework, but I'm no longer sure I can tick the honour code button! Learning is more important though, so I don't really want a copy/paste answer. What I have is not very elegant - it has an imperative feel to it too, which is driven by the issue with keeping some sort of state - see third guard. I'd welcome comments on design patterns.

Click to copy

type NodeName = Int
type Edges    = [NodeName]
type Explored = Bool
type Stack    = [(Int, Int)]

data Node  = Node NodeName Explored Edges Edges deriving (Eq, Show)
type Graph = Vector Node

main = do
    edges <- V.fromList `fmap` getEdges "SCC.txt"
    let 
        maxIndex = fst $ V.last edges
        gr = createGraph maxIndex edges
        res = dfsOuter gr
    --return gr
    putStrLn $ show res

dfsOuter gr = 
    let tmp = V.foldr' callInner (gr,[]) gr
    in snd tmp

callInner :: Node -> (Graph, Stack) -> (Graph, Stack)
callInner (Node idx _ fwd bwd) (gr,acc) = 
    let (Node _ explored _ _) = gr V.! idx
    in case explored of
        True  -> (gr, acc)
        False ->
            let
                initialStack = map (\l -> (idx, l)) bwd
                gr' = gr V.// [(idx, Node idx True fwd bwd)]
                (gr'', newScc) = dfsInner idx initialStack (length acc) (gr', [])
            in (gr'', newScc++acc)

dfsInner :: NodeName -> Stack -> Int -> (Graph, [(Int, Int)]) -> (Graph, [(Int, Int)])
dfsInner start [] finishCounter (gr, acc) = (gr, (start, finishCounter):acc)
dfsInner start stack finishCounter (gr, acc)
    | nextStart /= start =                      -- no more places to go from this node
        dfsInner nextStart stack (finishCounter + 1) $ (gr, (start, finishCounter):acc)
    | nextExplored = 
-- nextExplored || any (\(y,_) -> y == stack0Head) stack || any (\(x,_) -> x == stack0Head) acc =
        dfsInner start (tail stack) finishCounter (gr, acc)
    | otherwise =
        dfsInner nextEnd (add2Stack++stack) finishCounter (gr V.// [(nextEnd, Node idx True nextLHS nextRHS)], acc)
--      dfsInner gr stack0Head (add2Stack++stack) finishCounter acc

    where
        (nextStart, nextEnd) = head stack
        (Node idx nextExplored nextLHS nextRHS) = gr V.! nextEnd
        add2Stack = map (\l -> (nextEnd, l)) nextRHS

347

asked Jun 21 '14 18:06

Simon H

1 Answers

In a nutshell:

Know the time complexities.

There are a lot of fine points to optimization, a large subset of which being not very important in everyday programming, but fail to know the asymptotic complexities and programs will often just not work at all.

Haskell libraries usually document the complexities, especially when it's not obvious or not effective (linear of worse). In particular, all the complexities relevant to this question can be found in Data.List and Data.Vector.

The performance is killed by V.// here. Vectors are boxed or unboxed immutable contiguous arrays in memory. Hence, modifying them requires copying the entire vector. Since we have O(N) such modifications, the whole algorithm is O(n^2), so we have to copy about 2 terabytes with N = 500000. So, there isn't much use for marking visited nodes inside the vector. Instead, build an IntSet of indices as needed.

initialStack (length acc) also looks really bad. It's almost never a good idea to use length on large lists, because it's also O(n). It's probably not as nearly as bad as // in your code, since it sits in a relatively rarely occurring branch, but it'd still leave the performance crippled after we've corrected the vector issue.

Also, the search implementation seems rather unclear and overcomplicated to me. Aiming for a literal-minded translation of the pseudocode on the Wiki page should be a good start. Also, it's unnecessary to store the indices in nodes, since they can be determined from vector positions and the adjacency lists.

answered Oct 29 '22 06:10

András Kovács

Related questions
                            
                                Nearest neighbor 1 dimensional data with a specified range
                            
                                distinguishing objects with opencv
                            
                                Algorithm: optimally distribute N people among M tasks given P preferences
                            
                                "quick select" (or similar) implementation on Linux? (instead of sort|uniq -c|sort -rn|head -$N)
                            
                                Fastest way to use big numbers
                            
                                Algorithm to find all possible splits of a list
                            
                                Iterate sum of two squares
                            
                                Genetic/Evolutionary algorithm - Painter
                            
                                Best way to find if path exists in a unidirectional directed graph
                            
                                How to create a Boggle Board from a list of Words? (reverse Boggle solver!)
                            
                                An efficient algorithm to calculate the integer square root (isqrt) of arbitrarily large integers
                            
                                A* graph algorithm giving incorrect output
                            
                                How to convert from any large arbitrary base to another
                            
                                Algorithm for the decomposition of polygons
                            
                                Segmented Least Squares
                            
                                Find all polygons in points using MATLAB
                            
                                How to certify a planar embedding?
                            
                                Split array into evenly-distributed chunks
                            
                                How to obtain an "indirect implicant" in propositional logic
                            
                                Calculating max from [currPos] to [currPos - k] in large array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Optimising manipulation of large vectors

Tags:

algorithm

haskell

Simon H

People also ask

1 Answers

András Kovács

Recent Activity

Donate For Us