I'm extracting some data from a text document organized like this:
- "day 1"
    - "Person 1"
        - "Bill 1"
    - "Person 2"
        - "Bill 2"
I can read this into a list of tuples that looks like this:
[(0,["day 1"]),(1,["Person 1"]),(2,["Bill 1"]),(1,["Person 2"]),(2,["Bill 2"])]
Where the first item of each tuple indicates the heading level, and the second item the information associated with each heading.
My question is, how can I get a list of items that looks like this:
[["day 1","Person 1","Bill 1"],["day 1","Person 2","Bill 2"]]
I.e. one list per deepest nested item, containing all the information from the headings above it. The closest I've gotten is this:
f [] = []
f (x:xs) = row:f rest where
leaves = takeWhile (\i -> fst i > fst x) xs
rest = dropWhile (\i -> fst i > fst x) xs
row = concat $ map (\i -> (snd x):[snd i]) leaves
Which gives me this:
[[["day 1"],["Intro 1"],["day 1"],["Bill 1"],["day 1"],["Intro 2"],["day 1"],["Bill 2"]]]
I'd like the solution to work for any number of levels.
P.s. I'm new to Haskell. I have a sense that I could/should use a tree to store the data, but I can't wrap my head around it. I also could not think of a better title.
You were right that you should probably use a tree to store the data.  I'll copy how Data.Tree does it:
data Tree a = Node a (Forest a) deriving (Show)
type Forest a = [Tree a]
Now we want to take your weakly typed list of tuples and convert it to a (slightly) stronger Tree of Strings.  Any time you need to convert a weakly typed value and validate it before converting to a stronger type, you use a Parser:
type YourData = [(Int, [String])]
type Parser a = YourData -> Maybe (a, YourData)
The YourData type synonym represents the weak type that you are parsing.  The a type variable is the value you are retrieving from the parse.  Our Parser type returns a Maybe because the Parser might fail.  To see why, the following input does not correspond to a valid Tree, since it is missing level 1 of the tree:
[(0, ["val1"]), (2, ["val2"])]
If the Parser does succeed, it also returns the unconsumed input so that subsequent parsing stages can use it.
Now, curiously enough, the above Parser type exactly matches a well known monad transformer stack:
StateT s Maybe a
You can see this if you expand out the underlying implementation of StateT:
StateT s Maybe a ~ s -> Maybe (a, s)
This means we can just define:
import Control.Monad.Trans.State.Strict
type Parser a = StateT [(Int, [String])] Maybe a
If we do this, we get a Monad, Applicative and Alternative instance for our Parser type for free.  This makes it very easy to define parsers!
First, we must define a primitive parser that consumes a single node of the tree:
parseElement :: Int -> Parser String
parseElement level = StateT $ \list -> case list of
    []                  -> Nothing
    (level', strs):rest -> case strs of
        [str] ->
            if (level' == level)
            then Just (str, rest)
            else Nothing
        _     -> Nothing
This is the only non-trivial piece of code we have to write, which, because it is total, handles all the following corner cases:
The next part is where things get really elegant.  We can then define two mutually recursive parsers, one for parsing a Tree, and the other for parsing a Forest:
import Control.Applicative
parseTree :: Int -> Parser (Tree String)
parseTree level = Node <$> parseElement level <*> parseForest (level + 1)
parseForest :: Int -> Parser (Forest String)
parseForest level = many (parseTree level)
The first parser uses Applicative style, since StateT gave us an Applicative instance for free.  However, I could also have used StateT's Monad instance instead, to give code that's more readable for an imperative programmer:
parseTree :: Int -> Parser (Tree String)
parseTree level = do
    str    <- parseElement level
    forest <- parseForest (level + 1)
    return $ Node str forest
But what about the many function?  What's that doing?  Let's look at its type:
many :: (Alternative f) => f a -> f [a]
It takes anything that returns a value and implements Applicative and instead calls it repeatedly to return a list of values instead.  When we defined our Parser type in terms of State, we got an Alternative instance for free, so we can use the many function to convert something that parses a single Tree (i.e. parseTree), into something that parses a Forest (i.e. parseForest).
To use our Parser, we just rename an existing StateT function to make its purpose clear:
runParser :: Parser a -> [(Int, [String])] -> Maybe a runParser = evalStateT
Then we just run it!
>>> runParser (parseForest 0) [(0,["day 1"]),(1,["Person 1"]),(2,["Bill 1"]),(1,["Person 2"]),(2,["Bill 2"])]
Just [Node "day 1" [Node "Person 1" [Node "Bill 1" []],Node "Person 2" [Node "Bill 2" []]]]
That's just magic! Let's see what happens if we give it an invalid input:
>>> runParser (parseForest 0) [(0, ["val1"]), (2, ["val2"])]
Just [Node "val1" []]
It succeeds on a portion of the input! We can actually specify that it must consume the entire input by defining a parser that matches the end of the input:
eof :: Parser ()
eof = StateT $ \list -> case list of
    [] -> Just ((), [])
    _  -> Nothing
Now let's try it:
>>> runParser (parseForest 0 >> eof) [(0, ["val1"]), (2, ["val2"])]
Nothing
Perfect!
To answer your second question, we again solve the problem using mutually recursive functions:
flattenForest :: Forest a -> [[a]]
flattenForest forest = concatMap flattenTree forest
flattenTree :: Tree a -> [[a]]
flattenTree (Node a forest) = case forest of
    [] -> [[a]]
    _ -> map (a:) (flattenForest forest)
Let's try it!
>>> flattenForest [Node "day 1" [Node "Person 1" [Node "Bill 1" []],Node "Person 2" [Node "Bill 2" []]]]
[["day 1","Person 1","Bill 1"],["day 1","Person 2","Bill 2"]]
Now, technically I didn't have to use mutually recursive functions.  I could have done a single recursive function.  I was just following the definition of the Tree type from Data.Tree.
So in theory I could have shortened the code even further by skipping the intermediate Tree type and just parsing the flattened result directly, but I figured you might want to use the Tree-based representation for other purposes.
The key take home points from this are:
If you do these, you will write robust and elegant code that exactly matches the problem.
Here is the final code that incorporates everything I've said:
import Control.Applicative
import Control.Monad.Trans.State.Strict
import Data.Tree
type YourType = [(Int, [String])]
type Parser a = StateT [(Int, [String])] Maybe a
runParser :: Parser a -> [(Int, [String])] -> Maybe a
runParser = evalStateT
parseElement :: Int -> Parser String
parseElement level = StateT $ \list -> case list of
    []                  -> Nothing
    (level', strs):rest -> case strs of
        [str] ->
            if (level' == level)
            then Just (str, rest)
            else Nothing
        _     -> Nothing
parseTree :: Int -> Parser (Tree String)
parseTree level = Node <$> parseElement level <*> parseForest (level + 1)
parseForest :: Int -> Parser (Forest String)
parseForest level = many (parseTree level)
eof :: Parser ()
eof = StateT $ \list -> case list of
    [] -> Just ((), [])
    _  -> Nothing
flattenForest :: Forest a -> [[a]]
flattenForest forest = concatMap flattenTree forest
flattenTree :: Tree a -> [[a]]
flattenTree (Node a forest) = case forest of
    [] -> [[a]]
    _  -> map (a:) (flattenForest forest)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With