Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reconstruct a graph from BFS output in Haskell

I want to reconstruct the incidence structure of a graph in Haskell, which is given by the output of a breadth first traversal of it. Explicitly, the output consists of a root vertex and a list of neighborhoods (a neighborhood is a list of vertices marked as new or old (= already visited)), where each neighborhood corresponds to the least vertex which has not been assigned to a neighborhood, yet.

In any imperative language, I would solve the problem by using a queue:

Input: root vertex r, list of neighborhoods L
(1) Put r into the empty queue Q
(2) if Q is empty then STOP
(3) extract the first vertex v of Q
(4) extract the first neighborhood N of L
(5) append the unvisited vertices of N to Q
(6) remove the markings (new/old) of the nodes of N and assign v to N
(7) goto (2)

I tried to implement this naive algorithm in Haskell (by using a list or by using Data.Sequence as queue), but ghci always runs out of memory. This should not happen, because although the input consists of 300MB data, 16GB RAM should clearly suffice.

Therefore the naive implementation seems to cause a memory leak. How would you implement this algorithm in Haskell?

Edit: Here are the (slightly simplified) data types, I use:

data Output = Out !Vertex ![[BFSNode]]
data Vertex = Vertex Integer SomeMoreComplexData
data BFSNode = New Vertex | Old Integer

data Graph = ![Vertex] ![(Integer,[Integer])]

The data type "Output" contains the already parsed BFS output consisting of the root vertex and the lists of neighborhoods. BFSNode corresponds to a node in the BFS tree which belongs to either a new vertex which is visited for the first time, or to an old vertex which already has been visited and which is therefore referred by its unique number. Note that the parsing process works fine and consumes very few memory.

My aim is to convert "Output" into the data type "Graph" which consists of the lists of vertices and of an incidence list.

Here is a simplified version of my implementation:

readTree :: [[BFSNode]] -> Seq Integer -> Graph
readTree [] _ = Graph [] []
readTree (nb:nbs) qs =
    let (i :< qs') = viewl qs
        newVs = fromList $! map nodeNr . filter isNew $ nb
        (Graph vs adj) = readTree nbs $ qs' >< newVs
    in  Graph (map unNew (filter isNew nb) ++ vs) ((i,nub $ map nodeNr nb):adj)

"nbs" is the list of neighborhoods, "qs" is the queue. The function "nodeNr" extracts the unique identification number from a vertex, "isNew" tests whether a vertex is new, and "unNew" unpacks a new vertex from the data type "BFSNode".

Edit2: I think I localized the problem now. Maybe it has nothing to do with my implementation of the conversion process. My failure was to use the build in function "read" to read the data type "Output" from a file. I realized now that Haskell has problems with reading big files. Even if it were just about reading a list of integers, e.g.

main = do 
    txt <- readFile "test"
    writeFile "test2" . show $ (read txt :: [Integer]) }

the program will run out of memory if the file "test" is big enough. I understand now, that it is no good idea to parse data in this way, since "read" will load all data into the memory before showing any output, but I still do not understand why it fills 16GB of RAM although the file amounts not even 500MB. Do you have any idea what is wrong with "read"? Does Haskell show the same behavior on your machines?

Edit3: Now I implemented a stream based parsing function "readOutput" which takes a String and returns the data type "Output". This function is lazy, so I immediately get an output when I call it. But when I compose it with my conversion function "readTree" (which is clearly tail-recursive) I get no output at all and the memory usage increases as usual. What am I doing wrong?

Edit4: The problem in Edit3 came from some strictifications which I removed now.

like image 701
Dune Avatar asked Dec 04 '13 15:12

Dune


1 Answers

This question does not specify a key ingredient - how is the graph going to be represented in Haskell? Functional programs require carefully thought out data structures to maximize sharing and run efficiently. Usually, this means they're recursively built from nothing (inductive). There's a paper on inductive graphs and functional graph algorithms‎ that gives one representation:

module Test where

data Graph a = Empty | Extension (Graph a) [Int] (Int, a)
               deriving Show

That is, a graph is either Empty, or a (smaller) graph extended by one node. This is exactly how lists are built using Cons in functional languages, except that the additional node has to specify the smaller graph, the predecessor links ([Int]), and the new node number and data, (Int,a). Note that they also implemented this as an abstract type ''for efficiency reasons.''

A graph with one node can be generated by extending the empty graph.

singleton :: (Int,a) -> Graph a
singleton x = Extension Empty [] x

Using this structure, it's simple to define a recursive parse algorithm for your BFS tree.

data Mark a = Visited Int | New (Int,a) deriving Show

parse :: (Int,a) -> [[Mark a]] -> Graph a
parse x nbrs = extend Empty [x] nbrs

extend :: Graph a -> [(Int,a)] -> [[Mark a]] -> Graph a
extend g [] [] = g
extend g _  [] = Empty -- leftover nodes, really an error.
extend g [] _  = Empty -- leftover neighborhoods, really an error.
extend g (x : tl) (nbr : nbrs) =
  extend (Extension g (seen nbr) x) (news tl nbr) nbrs

news :: [(Int,a)] -> [Mark a] -> [(Int,a)]
news l (New x : tl) = news (uniq l x) tl
news l (_ : tl) = news l tl
news l [] = l

uniq :: [(Int,a)] -> (Int,a) -> [(Int,a)]
uniq (x:tl) y = x : if (fst x == fst y) then tl else uniq tl y
uniq [] y = [y]

seen :: [Mark a] -> [Int]
seen (Visited i : tl) = i : seen tl
seen (_ : tl) = seen tl
seen [] = []

m0 = [New (1,())]
m1 = [Visited 0, New (2,()), New (3,())]
m2 = [Visited 1, New (3,())]
m3 = [Visited 1, Visited 2]    
nbrs = [m0,m1,m2,m3]

Testing it out,

$ ghci
GHCi, version 7.6.3: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Prelude> :load Test
[1 of 1] Compiling Test             ( Test.hs, interpreted )
Ok, modules loaded: Test.
*Test> parse (0,()) nbrs
Extension (Extension (Extension (Extension Empty [] (0,())) [0] (1,())) [1] (2,())) [1,2] (3,())

For efficiency, you could do the following:

  1. The news and seen functions could be combined let (ns,sn) = newseen nbr ([],[]) and made tail-recursive (passing their partially constructed lists and returning immediately) for efficiency.

  2. Your input could keep track of the node at the center of each neighbor list. This would avoid the list concatenation in the stack of neighbors. Alternatively, you could use a functional dequeue to hold that stack.

If you haven't seen it, I'd recommend Okasaki's book on purely functional data structures.

like image 109
David M. Rogers Avatar answered Nov 06 '22 16:11

David M. Rogers