Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting a hierarchical data structure to a flat one in Haskell

Tags:

list

haskell

tree

I'm extracting some data from a text document organized like this:

- "day 1"
    - "Person 1"
        - "Bill 1"
    - "Person 2"
        - "Bill 2"

I can read this into a list of tuples that looks like this:

[(0,["day 1"]),(1,["Person 1"]),(2,["Bill 1"]),(1,["Person 2"]),(2,["Bill 2"])]

Where the first item of each tuple indicates the heading level, and the second item the information associated with each heading.

My question is, how can I get a list of items that looks like this:

[["day 1","Person 1","Bill 1"],["day 1","Person 2","Bill 2"]]

I.e. one list per deepest nested item, containing all the information from the headings above it. The closest I've gotten is this:

f [] = []
f (x:xs) = row:f rest where
leaves = takeWhile (\i -> fst i > fst x) xs
rest = dropWhile (\i -> fst i > fst x) xs
row = concat $ map (\i -> (snd x):[snd i]) leaves

Which gives me this:

[[["day 1"],["Intro 1"],["day 1"],["Bill 1"],["day 1"],["Intro 2"],["day 1"],["Bill 2"]]]

I'd like the solution to work for any number of levels.

P.s. I'm new to Haskell. I have a sense that I could/should use a tree to store the data, but I can't wrap my head around it. I also could not think of a better title.

like image 231
ajerneck Avatar asked Feb 19 '23 18:02

ajerneck


1 Answers

Trees

You were right that you should probably use a tree to store the data. I'll copy how Data.Tree does it:

data Tree a = Node a (Forest a) deriving (Show)

type Forest a = [Tree a]

Building the Tree

Now we want to take your weakly typed list of tuples and convert it to a (slightly) stronger Tree of Strings. Any time you need to convert a weakly typed value and validate it before converting to a stronger type, you use a Parser:

type YourData = [(Int, [String])]

type Parser a = YourData -> Maybe (a, YourData)

The YourData type synonym represents the weak type that you are parsing. The a type variable is the value you are retrieving from the parse. Our Parser type returns a Maybe because the Parser might fail. To see why, the following input does not correspond to a valid Tree, since it is missing level 1 of the tree:

[(0, ["val1"]), (2, ["val2"])]

If the Parser does succeed, it also returns the unconsumed input so that subsequent parsing stages can use it.

Now, curiously enough, the above Parser type exactly matches a well known monad transformer stack:

StateT s Maybe a

You can see this if you expand out the underlying implementation of StateT:

StateT s Maybe a ~ s -> Maybe (a, s)

This means we can just define:

import Control.Monad.Trans.State.Strict

type Parser a = StateT [(Int, [String])] Maybe a

If we do this, we get a Monad, Applicative and Alternative instance for our Parser type for free. This makes it very easy to define parsers!

First, we must define a primitive parser that consumes a single node of the tree:

parseElement :: Int -> Parser String
parseElement level = StateT $ \list -> case list of
    []                  -> Nothing
    (level', strs):rest -> case strs of
        [str] ->
            if (level' == level)
            then Just (str, rest)
            else Nothing
        _     -> Nothing

This is the only non-trivial piece of code we have to write, which, because it is total, handles all the following corner cases:

  • The list is empty
  • Your node has multiple values in it
  • The number in the tuple doesn't match the expected depth

The next part is where things get really elegant. We can then define two mutually recursive parsers, one for parsing a Tree, and the other for parsing a Forest:

import Control.Applicative

parseTree :: Int -> Parser (Tree String)
parseTree level = Node <$> parseElement level <*> parseForest (level + 1)

parseForest :: Int -> Parser (Forest String)
parseForest level = many (parseTree level)

The first parser uses Applicative style, since StateT gave us an Applicative instance for free. However, I could also have used StateT's Monad instance instead, to give code that's more readable for an imperative programmer:

parseTree :: Int -> Parser (Tree String)
parseTree level = do
    str    <- parseElement level
    forest <- parseForest (level + 1)
    return $ Node str forest

But what about the many function? What's that doing? Let's look at its type:

many :: (Alternative f) => f a -> f [a]

It takes anything that returns a value and implements Applicative and instead calls it repeatedly to return a list of values instead. When we defined our Parser type in terms of State, we got an Alternative instance for free, so we can use the many function to convert something that parses a single Tree (i.e. parseTree), into something that parses a Forest (i.e. parseForest).

To use our Parser, we just rename an existing StateT function to make its purpose clear:

runParser :: Parser a -> [(Int, [String])] -> Maybe a runParser = evalStateT

Then we just run it!

>>> runParser (parseForest 0) [(0,["day 1"]),(1,["Person 1"]),(2,["Bill 1"]),(1,["Person 2"]),(2,["Bill 2"])]
Just [Node "day 1" [Node "Person 1" [Node "Bill 1" []],Node "Person 2" [Node "Bill 2" []]]]

That's just magic! Let's see what happens if we give it an invalid input:

>>> runParser (parseForest 0) [(0, ["val1"]), (2, ["val2"])]
Just [Node "val1" []]

It succeeds on a portion of the input! We can actually specify that it must consume the entire input by defining a parser that matches the end of the input:

eof :: Parser ()
eof = StateT $ \list -> case list of
    [] -> Just ((), [])
    _  -> Nothing

Now let's try it:

>>> runParser (parseForest 0 >> eof) [(0, ["val1"]), (2, ["val2"])]
Nothing

Perfect!

Flattening the Tree

To answer your second question, we again solve the problem using mutually recursive functions:

flattenForest :: Forest a -> [[a]]
flattenForest forest = concatMap flattenTree forest

flattenTree :: Tree a -> [[a]]
flattenTree (Node a forest) = case forest of
    [] -> [[a]]
    _ -> map (a:) (flattenForest forest)

Let's try it!

>>> flattenForest [Node "day 1" [Node "Person 1" [Node "Bill 1" []],Node "Person 2" [Node "Bill 2" []]]]
[["day 1","Person 1","Bill 1"],["day 1","Person 2","Bill 2"]]

Now, technically I didn't have to use mutually recursive functions. I could have done a single recursive function. I was just following the definition of the Tree type from Data.Tree.

Conclusion

So in theory I could have shortened the code even further by skipping the intermediate Tree type and just parsing the flattened result directly, but I figured you might want to use the Tree-based representation for other purposes.

The key take home points from this are:

  • Learn Haskell abstractions to simplify your code
  • Always write total functions
  • Learn to use recursion effectively

If you do these, you will write robust and elegant code that exactly matches the problem.

Appendix

Here is the final code that incorporates everything I've said:

import Control.Applicative
import Control.Monad.Trans.State.Strict
import Data.Tree

type YourType = [(Int, [String])]

type Parser a = StateT [(Int, [String])] Maybe a

runParser :: Parser a -> [(Int, [String])] -> Maybe a
runParser = evalStateT

parseElement :: Int -> Parser String
parseElement level = StateT $ \list -> case list of
    []                  -> Nothing
    (level', strs):rest -> case strs of
        [str] ->
            if (level' == level)
            then Just (str, rest)
            else Nothing
        _     -> Nothing

parseTree :: Int -> Parser (Tree String)
parseTree level = Node <$> parseElement level <*> parseForest (level + 1)

parseForest :: Int -> Parser (Forest String)
parseForest level = many (parseTree level)

eof :: Parser ()
eof = StateT $ \list -> case list of
    [] -> Just ((), [])
    _  -> Nothing

flattenForest :: Forest a -> [[a]]
flattenForest forest = concatMap flattenTree forest

flattenTree :: Tree a -> [[a]]
flattenTree (Node a forest) = case forest of
    [] -> [[a]]
    _  -> map (a:) (flattenForest forest)
like image 116
Gabriella Gonzalez Avatar answered May 10 '23 20:05

Gabriella Gonzalez