Monadic parsing functional pearl - gluing multiple parsers together

Question

I am working my way through the functional pearl paper Monadic parsing in Haskell (after recommendation at haskellforall.com to read that paper to understand parsing). I wrote an implementation until section 4 on page 3 as below:

newtype Parser a = Parser (String -> [(a,String)])

parse (Parser p) = p

instance Monad Parser where
  return a = Parser (\cs -> [(a,cs)])
  p >>= f  = Parser (\cs -> concat [parse (f a) cs' | (a,cs') <- parse p cs])

item :: Parser Char
item = Parser (\cs -> case cs of
                      ""     -> []
                      (c:cs) -> [(c,cs)])

p :: Parser (Char,Char)
p = do { a <- item; item; b <- item; return (a,b)}

According to the paper, p is a parser that consumes three characters, skips middle one, and returns a pair of first and second. What I can't figure out is how the modified input string is passed to 2nd and 3rd definitions of item in p. We are not passing the result of first parser to second parser, and so on (because ;, syntactic sugar for >> is used which discards the result as shown by type signature (>>) :: Monad m => m a -> m b -> m b). I will appreciate explanation of how the modified function is being passed in last two invocations of item in p.

Another thing that confuses me is the handling of cs in item - it doesn't return (head,tail) pair. Shouldn't it be redefined as follow since the item parser consumes one character according to the paper:

item :: Parser Char
item = Parser (\cs -> case cs of
                      ""     -> []
                      (c:cs') -> [(c,cs')]) -- redefinition - use cs' to denote tail

chi · Accepted Answer

The syntax ; is not always syntactic sugar for >>.

Rather, we have:

do    m ; n  =  m >> n
do x<-m ; n  =  m >>= \x -> n

(The above translation is simplified, the full gory details can be found in the Haskell Report)

So, your definition for p is equivalent to:

p = item >>= \a -> ( item >> (item >>= \b -> return (a,b) ))

Here, you can see that the first and third items do not have their results discarded (because >>= binds them to a and b respectively), while the middle item does.

Also note that the code

\cs -> case cs of
        ""     -> []
       (c:cs) -> [(c,cs)]

is misleading since it is defining variable cs twice: once in the \cs and once in the pattern (c:cs). It is equivalent to

\cs -> case cs of
        ""     -> []
       (x:xs) -> [(x,xs)]

This clarifies that the final String is the output is not the original cs one, but rather its tail xs.

In a comment, the poster wondered why the three uses of item do not return the same result, i.e., why in return (a,b) the character a is not equal to b. This is due to the >>= monadic operator, which in this Parser monad automatically feeds the output string xs of each item occurence to the next one. Indeed, the whole point of this monad is to help feeding the "leftover" output of each parser as the "to-be-consumed" input in the next one. This has two advantages: it frees the programmer from having to write code to pass this string around, and it ensures that the string is not accidentally "rewound" to a previous state. To illustrate the latter point, here's some wrong code:

let [(c1,s1)] = someParser someInitialString
    [(c2,s2)] = anotherParser1 s1
    [(c3,s3)] = anotherParser2 s2
    [(c4,s4)] = anotherParser3 s3
    [(c5,s5)] = anotherParser4 s2  -- Whoops! Should have been s4
    in [c1,c2,c3,c4,c5]

In the last step the string, after having been consumed multiple times, is wrongly rolled back to a previous state, as if the parsers anotherParser2 and anotherParser3 did not consume anything at all. This error is prevented by composing parsers through >>= instead.

rafalio · Answer

I'll try shedding some more light regarding >>. As you see in the other answer, you should desugar the do's into >>= to better understand what's going on.

Let's for example write a parser that parses two chars and returns them.

twoChars :: Parser (Char,Char)
twoChars = do
  i <- item
  j <- item
  return (i,j)

Now, desugar the do syntax:

twoChars :: Parser (Char,Char)
twoChars =
  item >>= (\i ->
  item >>= (\j ->
  return (i,j) ) )

I put brackets for clarity. As you see, the second item receives the result of the first item parser in the anonymous function, with the result bound to i. The >>= function takes a parser, a function, and returns a parser. Best way to understand it would be to plug it into the definition:

f = \i → item »= \j → return (i,j)
twoChars = item >>= f
twoChars = Parser (\cs -> concat [parse (f a) cs' | (a,cs') <- parse item cs])

So we got back a new Parser. Try to imagine what it will do on an input "abc". cs is bound to "abc", and the item Parser is used to get back [('a',"bc")]. Now, we apply f to 'a', to get back the new parser:

item >>= \j -> return ('a',j)

This parser will be passed the rest of our string left to process ("bc"), and it will use the item parser to get out the b when the \j above is bound to b. We then get a return ('a','b') statement, which puts ('a','b') into a parser that just return ('a','b').

I hope this clears up how the information flow happens. Now, suppose that you want to ignore a character. You could do it like this.

twoChars :: Parser (Char,Char)
twoChars =
  item >>= \i ->
  item >>= \j ->
  item >>= \k ->
  return (i,k)

It's ok that the j is bound to 'b' for the example "abc", you never use it. We can so replace j by _.

twoChars :: Parser (Char,Char)
twoChars =
  item >>= \i ->
  item >>= \_ ->
  item >>= \k ->
  return (i,k)

But we also know that >> :: m a -> m b -> m b can be defined as:

p >> q = p >>= \_ -> q

So we are left with

twoChars :: Parser (Char,Char)
twoChars =
  item >>= \i ->
  item >>
  item >>= \k ->
  return (i,k)

Finally, you can sugar this back into do. The application of >> simply sugars into a single-line statement with no bounding. It results in:

twoChars :: Parser (Char,Char)
twoChars = do
  i <- item
  item
  j <- item
  return (i,j)

Hope this cleared some things up.

Will Ness · Answer

The more uniform translation of your

p3 = do { a <- item; item; b <- item; return (a,b)}
  -- do { a <- item; z <- item; b <- item; return (a,b)}  -- z is ignored

is

p3 = item >>= (\a ->
       item >>= (\z ->     
         item >>= (\b ->
           return (a,b))))     -- z is unused

(the key observation here is that the functions are nested). Which means that

-- parse (return a) cs = [(a,cs)]
-- parse (p >>= f)  cs = [r | (a,cs1) <- parse p cs,         -- concat
--                            r       <- parse (f a) cs1] )  --   inlined !
parse p3 cs 
  = [ r | (a,cs1) <- parse item cs,
          r <- [ r | (z,cs2) <- parse item cs1,
                     r <- [ r | (b,cs3) <- parse item cs2,
                                r <-    -- parse (return (a,b)) cs3
                                     [((a,b),cs3)]]]]        -- z is unused

  = [ ((a,b),cs3) | (a,cs1) <- parse item cs,
                    (_,cs2) <- parse item cs1, 
                    (b,cs3) <- parse item cs2]

So you see, "the input string" does change: first it's cs, then cs1, then cs2.

That is the simple real computation behind all the Parser tags and do syntax. It's all just about the chaining of inputs and outputs in the nested loops, in the end:

parse p3 cs =
    for each (a,cs1) in (parse item cs):
       for each (z,cs2) in (parse item cs1):
          for each (b,cs3) in (parse item cs2):
             yield ((a,b),cs3)

Monadic parsing functional pearl - gluing multiple parsers together

Tags:

parsing

haskell

Sal

3 Answers

chi

rafalio

Will Ness

Recent Activity

Donate For Us

Monadic parsing functional pearl - gluing multiple parsers together

Tags:

parsing

haskell

Sal

3 Answers

chi

rafalio

Will Ness

Related questions

Recent Activity

Donate For Us