I am trying to parse a very simple language that consists of only decimal or binary numbers. For example, here are some valid inputs:
#b1
#d1
#b0101
#d1234
I am having a problem using Parsec's choice operator: <|>
. According to the tutorial: Write yourself a Scheme in 48 hours:
[The choice operator] tries the first parser, then if it fails, tries the second. If either succeeds, then it returns the value returned by that parser..
But in my experience, I see that the order of the parsers supplied matters. Here is my program:
import System.Environment
import Text.ParserCombinators.Parsec
main :: IO ()
main = do
(x:_) <- getArgs
putStrLn ( "Hello, " ++ readExp x)
bin :: Parser String
bin = do string "#b"
x <- many( oneOf "01")
return x
dec :: Parser String
dec = do string "#d"
x <- many( oneOf "0123456789")
return x
-- Why does order matter here?
parseExp = (bin <|> dec)
readExp :: String -> String
readExp input = case parse parseExp "test" input of
Left error -> "Error: " ++ show error
Right val -> "Found val" ++ show val
Here is how I am running the program:
$ cabal sandbox init
$ cabal install parsec
$ cabal exec ghc Main
$ ./Main "#d1"
Hello, Error: "test" (line 1, column 1):
unexpected "d"
expecting "#b"
$ ./Main "#b1"
Hello, Found val"1"
If I change the order of the parsers as follows:
parseExp = (dec <|> bin)
then only binary numbers are detected and the program fails to identify the decimal numbers.
With the tests that I have performed, I see this problem only happens when one of the parsers has started parsing an input e.g. if a hash character #
is found, the bin parser is activated ending up in failing as the next character expected is b
and not d
. It seems like there should be some kind of backtracking that should happen, which I am not aware of.
Appreciate the help!
Parsec has two kinds of "failure": there are failures that consume input, and failures that don't. To avoid backtracking (and therefore holding onto inputs longer than necessary/being generally unfriendly to the garbage collector), (<|>)
commits to the first parser as soon as it consumes input; so that if its first argument consumes input and fails, its second parser never gets a chance to succeed. You can explicitly request backtracking behavior with try
, thus:
Text.Parsec> parse (string "ab" <|> string "ac") "" "ac"
Left (line 1, column 1):
unexpected "c"
expecting "ab"
Text.Parsec> parse (try (string "ab") <|> string "ac") "" "ac"
Right "ac"
Unfortunately, try
has some pretty annoying performance penalties, which means that if you want a performant parser, you will have to refactor your grammar a bit. I would do that with the above parser this way:
Text.Parsec> parse (char 'a' >> (("ab" <$ char 'b') <|> ("ac" <$ char 'c'))) "" "ac"
Right "ac"
In your case, you will need to factor out the "#" mark similarly.
Read carefully:
[The choice operator] tries the first parser, then if it fails, tries the second. If either succeeds, then it returns the value returned by that parser..
This implies that the first parser is tried first, and if it succeeds, the second parser is NOT tried at all! This means that the first parser has a higher priority, so <|>
is not commutative, in general.
A simple counterexample can be crafted with some always-succeeding parsers, e.g.
dummy :: Parser Bool
dummy = return True <|> return False
The above is equivalent to return True
: since the first succeeds, the second is immaterial.
On top of that, parsec is designed to commit early on the first branch when that has successfully consumed some input. This design sacrifices perfect nondeterminism for efficiency. Otherwise, parsec would often need to keep in memory all the file being parsed, because if a parse error happens, some new alternative needs to be tried.
For instance,
p = (char 'a' >> char 'b' >> ...) <|>
(char 'a' >> char 'c' >> ...)
will not work as one might expect. As soon as 'a'
is consumed successfully by the first branch, parsec commits on it. This means that if 'c'
follows, then the first branch will fail but it's too late for the second one to be tried. The whole parser simply fails.
To solve this one can factor out the common prefix, e.g.
p = char 'a' >> ( (char 'b' >> ...) <|> (char 'c' >> ...) )
or resort to try
,
p = (try (char 'a' >> char 'b') >> ...) <|>
(char 'a' >> char 'c' >> ...)
try
basically tells parsec not to commit to the branch until the whole parser under try
succeeds. If abused, it can cause parsec to keep in memory a large portion of the file, but the user at least has some control on that. Theoretically, perfect nondeterminism can be recovered by always adding try
to the whole left branch of <|>
. This however is not recommended, since it prods the programmer into writing an inefficient parser, both because of the memory issue and the fact that one can easily write a grammar requiring an exponential number of backtracks to successfully parse.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With