I am working my way through the functional pearl paper Monadic parsing in Haskell
(after recommendation at haskellforall.com
to read that paper to understand parsing). I wrote an implementation until section 4 on page 3 as below:
newtype Parser a = Parser (String -> [(a,String)])
parse (Parser p) = p
instance Monad Parser where
return a = Parser (\cs -> [(a,cs)])
p >>= f = Parser (\cs -> concat [parse (f a) cs' | (a,cs') <- parse p cs])
item :: Parser Char
item = Parser (\cs -> case cs of
"" -> []
(c:cs) -> [(c,cs)])
p :: Parser (Char,Char)
p = do { a <- item; item; b <- item; return (a,b)}
According to the paper, p
is a parser that consumes three characters, skips middle one, and returns a pair of first and second. What I can't figure out is how the modified input string is passed to 2nd and 3rd definitions of item
in p
. We are not passing the result of first parser to second parser, and so on (because ;
, syntactic sugar for >>
is used which discards the result as shown by type signature (>>) :: Monad m => m a -> m b -> m b
). I will appreciate explanation of how the modified function is being passed in last two invocations of item
in p
.
Another thing that confuses me is the handling of cs
in item
- it doesn't return (head,tail) pair. Shouldn't it be redefined as follow since the item
parser consumes one character according to the paper:
item :: Parser Char
item = Parser (\cs -> case cs of
"" -> []
(c:cs') -> [(c,cs')]) -- redefinition - use cs' to denote tail
The syntax ;
is not always syntactic sugar for >>
.
Rather, we have:
do m ; n = m >> n
do x<-m ; n = m >>= \x -> n
(The above translation is simplified, the full gory details can be found in the Haskell Report)
So, your definition for p
is equivalent to:
p = item >>= \a -> ( item >> (item >>= \b -> return (a,b) ))
Here, you can see that the first and third item
s do not have their results discarded (because >>=
binds them to a
and b
respectively), while the middle item
does.
Also note that the code
\cs -> case cs of
"" -> []
(c:cs) -> [(c,cs)]
is misleading since it is defining variable cs
twice: once in the \cs
and once in the
pattern (c:cs)
. It is equivalent to
\cs -> case cs of
"" -> []
(x:xs) -> [(x,xs)]
This clarifies that the final String
is the output is not the original cs
one, but rather its tail xs
.
In a comment, the poster wondered why the three uses of item
do not return the same result, i.e., why in return (a,b)
the character a
is not equal to b
. This is due to the >>=
monadic operator, which in this Parser
monad automatically feeds the output string xs
of each item
occurence to the next one. Indeed, the whole point of this monad is to help feeding the "leftover" output of each parser as the "to-be-consumed" input in the next one. This has two advantages: it frees the programmer from having to write code to pass this string around, and it ensures that the string is not accidentally "rewound" to a previous state. To illustrate the latter point, here's some wrong code:
let [(c1,s1)] = someParser someInitialString
[(c2,s2)] = anotherParser1 s1
[(c3,s3)] = anotherParser2 s2
[(c4,s4)] = anotherParser3 s3
[(c5,s5)] = anotherParser4 s2 -- Whoops! Should have been s4
in [c1,c2,c3,c4,c5]
In the last step the string, after having been consumed multiple times, is wrongly rolled back to a previous state, as if the parsers anotherParser2
and anotherParser3
did not consume anything at all. This error is prevented by composing parsers through >>=
instead.
I'll try shedding some more light regarding >>
.
As you see in the other answer, you should desugar the do's into >>=
to better understand what's going on.
Let's for example write a parser that parses two chars and returns them.
twoChars :: Parser (Char,Char)
twoChars = do
i <- item
j <- item
return (i,j)
Now, desugar the do
syntax:
twoChars :: Parser (Char,Char)
twoChars =
item >>= (\i ->
item >>= (\j ->
return (i,j) ) )
I put brackets for clarity. As you see, the second item
receives the result of the first item
parser in the anonymous function, with the result bound to i
. The >>=
function takes a parser, a function, and returns a parser. Best way to understand it would be to plug it into the definition:
f = \i → item »= \j → return (i,j)
twoChars = item >>= f
twoChars = Parser (\cs -> concat [parse (f a) cs' | (a,cs') <- parse item cs])
So we got back a new Parser. Try to imagine what it will do on an input "abc". cs
is bound to "abc", and the item Parser is used to get back [('a',"bc")]. Now, we apply f
to 'a', to get back the new parser:
item >>= \j -> return ('a',j)
This parser will be passed the rest of our string left to process ("bc"
), and it will use the item
parser to get out the b
when the \j
above is bound to b
. We then get a return ('a','b')
statement, which puts ('a','b')
into a parser that just return ('a','b')
.
I hope this clears up how the information flow happens. Now, suppose that you want to ignore a character. You could do it like this.
twoChars :: Parser (Char,Char)
twoChars =
item >>= \i ->
item >>= \j ->
item >>= \k ->
return (i,k)
It's ok that the j
is bound to 'b'
for the example "abc", you never use it. We can so replace j
by _
.
twoChars :: Parser (Char,Char)
twoChars =
item >>= \i ->
item >>= \_ ->
item >>= \k ->
return (i,k)
But we also know that >> :: m a -> m b -> m b
can be defined as:
p >> q = p >>= \_ -> q
So we are left with
twoChars :: Parser (Char,Char)
twoChars =
item >>= \i ->
item >>
item >>= \k ->
return (i,k)
Finally, you can sugar this back into do
. The application of >>
simply sugars into a single-line statement with no bounding. It results in:
twoChars :: Parser (Char,Char)
twoChars = do
i <- item
item
j <- item
return (i,j)
Hope this cleared some things up.
The more uniform translation of your
p3 = do { a <- item; item; b <- item; return (a,b)}
-- do { a <- item; z <- item; b <- item; return (a,b)} -- z is ignored
is
p3 = item >>= (\a ->
item >>= (\z ->
item >>= (\b ->
return (a,b)))) -- z is unused
(the key observation here is that the functions are nested). Which means that
-- parse (return a) cs = [(a,cs)]
-- parse (p >>= f) cs = [r | (a,cs1) <- parse p cs, -- concat
-- r <- parse (f a) cs1] ) -- inlined !
parse p3 cs
= [ r | (a,cs1) <- parse item cs,
r <- [ r | (z,cs2) <- parse item cs1,
r <- [ r | (b,cs3) <- parse item cs2,
r <- -- parse (return (a,b)) cs3
[((a,b),cs3)]]]] -- z is unused
= [ ((a,b),cs3) | (a,cs1) <- parse item cs,
(_,cs2) <- parse item cs1,
(b,cs3) <- parse item cs2]
So you see, "the input string" does change: first it's cs
, then cs1
, then cs2
.
That is the simple real computation behind all the Parser
tags and do
syntax. It's all just about the chaining of inputs and outputs in the nested loops, in the end:
parse p3 cs =
for each (a,cs1) in (parse item cs):
for each (z,cs2) in (parse item cs1):
for each (b,cs3) in (parse item cs2):
yield ((a,b),cs3)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With