I'm extracting some data from a text document organized like this:
- "day 1"
- "Person 1"
- "Bill 1"
- "Person 2"
- "Bill 2"
I can read this into a list of tuples that looks like this:
[(0,["day 1"]),(1,["Person 1"]),(2,["Bill 1"]),(1,["Person 2"]),(2,["Bill 2"])]
Where the first item of each tuple indicates the heading level, and the second item the information associated with each heading.
My question is, how can I get a list of items that looks like this:
[["day 1","Person 1","Bill 1"],["day 1","Person 2","Bill 2"]]
I.e. one list per deepest nested item, containing all the information from the headings above it. The closest I've gotten is this:
f [] = []
f (x:xs) = row:f rest where
leaves = takeWhile (\i -> fst i > fst x) xs
rest = dropWhile (\i -> fst i > fst x) xs
row = concat $ map (\i -> (snd x):[snd i]) leaves
Which gives me this:
[[["day 1"],["Intro 1"],["day 1"],["Bill 1"],["day 1"],["Intro 2"],["day 1"],["Bill 2"]]]
I'd like the solution to work for any number of levels.
P.s. I'm new to Haskell. I have a sense that I could/should use a tree to store the data, but I can't wrap my head around it. I also could not think of a better title.
You were right that you should probably use a tree to store the data. I'll copy how Data.Tree
does it:
data Tree a = Node a (Forest a) deriving (Show)
type Forest a = [Tree a]
Now we want to take your weakly typed list of tuples and convert it to a (slightly) stronger Tree
of String
s. Any time you need to convert a weakly typed value and validate it before converting to a stronger type, you use a Parser
:
type YourData = [(Int, [String])]
type Parser a = YourData -> Maybe (a, YourData)
The YourData
type synonym represents the weak type that you are parsing. The a
type variable is the value you are retrieving from the parse. Our Parser
type returns a Maybe
because the Parser
might fail. To see why, the following input does not correspond to a valid Tree
, since it is missing level 1 of the tree:
[(0, ["val1"]), (2, ["val2"])]
If the Parser
does succeed, it also returns the unconsumed input so that subsequent parsing stages can use it.
Now, curiously enough, the above Parser
type exactly matches a well known monad transformer stack:
StateT s Maybe a
You can see this if you expand out the underlying implementation of StateT
:
StateT s Maybe a ~ s -> Maybe (a, s)
This means we can just define:
import Control.Monad.Trans.State.Strict
type Parser a = StateT [(Int, [String])] Maybe a
If we do this, we get a Monad
, Applicative
and Alternative
instance for our Parser
type for free. This makes it very easy to define parsers!
First, we must define a primitive parser that consumes a single node of the tree:
parseElement :: Int -> Parser String
parseElement level = StateT $ \list -> case list of
[] -> Nothing
(level', strs):rest -> case strs of
[str] ->
if (level' == level)
then Just (str, rest)
else Nothing
_ -> Nothing
This is the only non-trivial piece of code we have to write, which, because it is total, handles all the following corner cases:
The next part is where things get really elegant. We can then define two mutually recursive parsers, one for parsing a Tree
, and the other for parsing a Forest
:
import Control.Applicative
parseTree :: Int -> Parser (Tree String)
parseTree level = Node <$> parseElement level <*> parseForest (level + 1)
parseForest :: Int -> Parser (Forest String)
parseForest level = many (parseTree level)
The first parser uses Applicative
style, since StateT
gave us an Applicative
instance for free. However, I could also have used StateT
's Monad
instance instead, to give code that's more readable for an imperative programmer:
parseTree :: Int -> Parser (Tree String)
parseTree level = do
str <- parseElement level
forest <- parseForest (level + 1)
return $ Node str forest
But what about the many
function? What's that doing? Let's look at its type:
many :: (Alternative f) => f a -> f [a]
It takes anything that returns a value and implements Applicative
and instead calls it repeatedly to return a list of values instead. When we defined our Parser
type in terms of State
, we got an Alternative
instance for free, so we can use the many
function to convert something that parses a single Tree
(i.e. parseTree
), into something that parses a Forest
(i.e. parseForest
).
To use our Parser
, we just rename an existing StateT
function to make its purpose clear:
runParser :: Parser a -> [(Int, [String])] -> Maybe a runParser = evalStateT
Then we just run it!
>>> runParser (parseForest 0) [(0,["day 1"]),(1,["Person 1"]),(2,["Bill 1"]),(1,["Person 2"]),(2,["Bill 2"])]
Just [Node "day 1" [Node "Person 1" [Node "Bill 1" []],Node "Person 2" [Node "Bill 2" []]]]
That's just magic! Let's see what happens if we give it an invalid input:
>>> runParser (parseForest 0) [(0, ["val1"]), (2, ["val2"])]
Just [Node "val1" []]
It succeeds on a portion of the input! We can actually specify that it must consume the entire input by defining a parser that matches the end of the input:
eof :: Parser ()
eof = StateT $ \list -> case list of
[] -> Just ((), [])
_ -> Nothing
Now let's try it:
>>> runParser (parseForest 0 >> eof) [(0, ["val1"]), (2, ["val2"])]
Nothing
Perfect!
To answer your second question, we again solve the problem using mutually recursive functions:
flattenForest :: Forest a -> [[a]]
flattenForest forest = concatMap flattenTree forest
flattenTree :: Tree a -> [[a]]
flattenTree (Node a forest) = case forest of
[] -> [[a]]
_ -> map (a:) (flattenForest forest)
Let's try it!
>>> flattenForest [Node "day 1" [Node "Person 1" [Node "Bill 1" []],Node "Person 2" [Node "Bill 2" []]]]
[["day 1","Person 1","Bill 1"],["day 1","Person 2","Bill 2"]]
Now, technically I didn't have to use mutually recursive functions. I could have done a single recursive function. I was just following the definition of the Tree
type from Data.Tree
.
So in theory I could have shortened the code even further by skipping the intermediate Tree
type and just parsing the flattened result directly, but I figured you might want to use the Tree
-based representation for other purposes.
The key take home points from this are:
If you do these, you will write robust and elegant code that exactly matches the problem.
Here is the final code that incorporates everything I've said:
import Control.Applicative
import Control.Monad.Trans.State.Strict
import Data.Tree
type YourType = [(Int, [String])]
type Parser a = StateT [(Int, [String])] Maybe a
runParser :: Parser a -> [(Int, [String])] -> Maybe a
runParser = evalStateT
parseElement :: Int -> Parser String
parseElement level = StateT $ \list -> case list of
[] -> Nothing
(level', strs):rest -> case strs of
[str] ->
if (level' == level)
then Just (str, rest)
else Nothing
_ -> Nothing
parseTree :: Int -> Parser (Tree String)
parseTree level = Node <$> parseElement level <*> parseForest (level + 1)
parseForest :: Int -> Parser (Forest String)
parseForest level = many (parseTree level)
eof :: Parser ()
eof = StateT $ \list -> case list of
[] -> Just ((), [])
_ -> Nothing
flattenForest :: Forest a -> [[a]]
flattenForest forest = concatMap flattenTree forest
flattenTree :: Tree a -> [[a]]
flattenTree (Node a forest) = case forest of
[] -> [[a]]
_ -> map (a:) (flattenForest forest)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With