Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimising manipulation of large vectors

This is a follow up to my previous question about processing a Vector representation of a 5.1m edge directed graph. I am trying to implement Kosaraju's graph algorithm and thus need to rearrange my Vector in the order of the finishing times of a depth first search (DFS) on the edges reversed. I have code that runs on small data sets but that fails to return in 10 minutes on the full data set. (I can't exclude that a loop arises from the big graph, but there are no signs of that on my test data.)

DFS needs to avoid revisiting nodes, so I need some sort of 'state' for the search (currently a tuple, should I use a State Monad?). The first search should return a reordered Vector, but I am keeping things simple at present by returning a list of the reordered Node indexes so that I can process the Vector in one go subsequently.

I presume the issue lies in dfsInner. The code below 'remembers' the nodes visited updating the explored field of each node (third guard). Although I tried to make it tail recursive, the code seems to grow memory use fairly fast. Do I need to enforce some strictness and if so, how? (I have another version that I use on a single search search, which checks for previous visits by looking at the start nodes of the unexplored edges on the stack and the list of nodes that have been completed. This does not grow so quickly, but does not return for any well connected node.)

However, it could also be the foldr', but how can I detect that?

This is supposedly Coursera homework, but I'm no longer sure I can tick the honour code button! Learning is more important though, so I don't really want a copy/paste answer. What I have is not very elegant - it has an imperative feel to it too, which is driven by the issue with keeping some sort of state - see third guard. I'd welcome comments on design patterns.

type NodeName = Int
type Edges    = [NodeName]
type Explored = Bool
type Stack    = [(Int, Int)]

data Node  = Node NodeName Explored Edges Edges deriving (Eq, Show)
type Graph = Vector Node

main = do
    edges <- V.fromList `fmap` getEdges "SCC.txt"
    let 
        maxIndex = fst $ V.last edges
        gr = createGraph maxIndex edges
        res = dfsOuter gr
    --return gr
    putStrLn $ show res

dfsOuter gr = 
    let tmp = V.foldr' callInner (gr,[]) gr
    in snd tmp

callInner :: Node -> (Graph, Stack) -> (Graph, Stack)
callInner (Node idx _ fwd bwd) (gr,acc) = 
    let (Node _ explored _ _) = gr V.! idx
    in case explored of
        True  -> (gr, acc)
        False ->
            let
                initialStack = map (\l -> (idx, l)) bwd
                gr' = gr V.// [(idx, Node idx True fwd bwd)]
                (gr'', newScc) = dfsInner idx initialStack (length acc) (gr', [])
            in (gr'', newScc++acc)

dfsInner :: NodeName -> Stack -> Int -> (Graph, [(Int, Int)]) -> (Graph, [(Int, Int)])
dfsInner start [] finishCounter (gr, acc) = (gr, (start, finishCounter):acc)
dfsInner start stack finishCounter (gr, acc)
    | nextStart /= start =                      -- no more places to go from this node
        dfsInner nextStart stack (finishCounter + 1) $ (gr, (start, finishCounter):acc)
    | nextExplored = 
-- nextExplored || any (\(y,_) -> y == stack0Head) stack || any (\(x,_) -> x == stack0Head) acc =
        dfsInner start (tail stack) finishCounter (gr, acc)
    | otherwise =
        dfsInner nextEnd (add2Stack++stack) finishCounter (gr V.// [(nextEnd, Node idx True nextLHS nextRHS)], acc)
--      dfsInner gr stack0Head (add2Stack++stack) finishCounter acc

    where
        (nextStart, nextEnd) = head stack
        (Node idx nextExplored nextLHS nextRHS) = gr V.! nextEnd
        add2Stack = map (\l -> (nextEnd, l)) nextRHS
like image 347
Simon H Avatar asked Jun 21 '14 18:06

Simon H


People also ask

Is array faster than vector C++?

A std::vector can never be faster than an array, as it has (a pointer to the first element of) an array as one of its data members. But the difference in run-time speed is slim and absent in any non-trivial program. One reason for this myth to persist, are examples that compare raw arrays with mis-used std::vectors.

How fast is vector Push_back?

#3 When filling up or copying into a vector, prefer assignment over insert() or push_back(). So we can see that vector assignment is 55.38% faster than Insert() and 89% faster than push_back().


1 Answers

In a nutshell:

Know the time complexities.

There are a lot of fine points to optimization, a large subset of which being not very important in everyday programming, but fail to know the asymptotic complexities and programs will often just not work at all.

Haskell libraries usually document the complexities, especially when it's not obvious or not effective (linear of worse). In particular, all the complexities relevant to this question can be found in Data.List and Data.Vector.

The performance is killed by V.// here. Vectors are boxed or unboxed immutable contiguous arrays in memory. Hence, modifying them requires copying the entire vector. Since we have O(N) such modifications, the whole algorithm is O(n^2), so we have to copy about 2 terabytes with N = 500000. So, there isn't much use for marking visited nodes inside the vector. Instead, build an IntSet of indices as needed.

initialStack (length acc) also looks really bad. It's almost never a good idea to use length on large lists, because it's also O(n). It's probably not as nearly as bad as // in your code, since it sits in a relatively rarely occurring branch, but it'd still leave the performance crippled after we've corrected the vector issue.

Also, the search implementation seems rather unclear and overcomplicated to me. Aiming for a literal-minded translation of the pseudocode on the Wiki page should be a good start. Also, it's unnecessary to store the indices in nodes, since they can be determined from vector positions and the adjacency lists.

like image 78
András Kovács Avatar answered Oct 29 '22 06:10

András Kovács