Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does it seem that the Parsec Choice operator depends on order of the parsers? [duplicate]

Tags:

haskell

parsec

I am trying to parse a very simple language that consists of only decimal or binary numbers. For example, here are some valid inputs:

#b1
#d1
#b0101
#d1234

I am having a problem using Parsec's choice operator: <|>. According to the tutorial: Write yourself a Scheme in 48 hours:

[The choice operator] tries the first parser, then if it fails, tries the second. If either succeeds, then it returns the value returned by that parser..

But in my experience, I see that the order of the parsers supplied matters. Here is my program:

import System.Environment
import Text.ParserCombinators.Parsec

main :: IO ()
main = do
  (x:_) <- getArgs 
  putStrLn ( "Hello, " ++ readExp x)

bin :: Parser String
bin = do string "#b"
         x <- many( oneOf "01")
         return x

dec :: Parser String
dec = do string "#d"
         x <- many( oneOf "0123456789")
         return x

-- Why does order matter here?
parseExp = (bin <|> dec) 

readExp :: String -> String
readExp input = case parse parseExp "test" input of
                      Left error -> "Error: " ++ show error
                      Right val  -> "Found val" ++ show val 

Here is how I am running the program:

Installing dependencies

$ cabal sandbox init
$ cabal install parsec

Compiling

$ cabal exec ghc Main

Run

$ ./Main "#d1"
Hello, Error: "test" (line 1, column 1):
unexpected "d"
expecting "#b"

$ ./Main "#b1"
Hello, Found val"1"

If I change the order of the parsers as follows:

parseExp = (dec <|> bin) 

then only binary numbers are detected and the program fails to identify the decimal numbers.

With the tests that I have performed, I see this problem only happens when one of the parsers has started parsing an input e.g. if a hash character # is found, the bin parser is activated ending up in failing as the next character expected is b and not d. It seems like there should be some kind of backtracking that should happen, which I am not aware of.

Appreciate the help!

like image 985
mandark Avatar asked Oct 10 '15 18:10

mandark


2 Answers

Parsec has two kinds of "failure": there are failures that consume input, and failures that don't. To avoid backtracking (and therefore holding onto inputs longer than necessary/being generally unfriendly to the garbage collector), (<|>) commits to the first parser as soon as it consumes input; so that if its first argument consumes input and fails, its second parser never gets a chance to succeed. You can explicitly request backtracking behavior with try, thus:

Text.Parsec> parse (string "ab" <|> string "ac") "" "ac"
Left (line 1, column 1):
unexpected "c"
expecting "ab"
Text.Parsec> parse (try (string "ab") <|> string "ac") "" "ac"
Right "ac"

Unfortunately, try has some pretty annoying performance penalties, which means that if you want a performant parser, you will have to refactor your grammar a bit. I would do that with the above parser this way:

Text.Parsec> parse (char 'a' >> (("ab" <$ char 'b') <|> ("ac" <$ char 'c'))) "" "ac"
Right "ac"

In your case, you will need to factor out the "#" mark similarly.

like image 57
Daniel Wagner Avatar answered Sep 21 '22 12:09

Daniel Wagner


Read carefully:

[The choice operator] tries the first parser, then if it fails, tries the second. If either succeeds, then it returns the value returned by that parser..

This implies that the first parser is tried first, and if it succeeds, the second parser is NOT tried at all! This means that the first parser has a higher priority, so <|> is not commutative, in general.

A simple counterexample can be crafted with some always-succeeding parsers, e.g.

dummy :: Parser Bool
dummy = return True <|> return False

The above is equivalent to return True: since the first succeeds, the second is immaterial.


On top of that, parsec is designed to commit early on the first branch when that has successfully consumed some input. This design sacrifices perfect nondeterminism for efficiency. Otherwise, parsec would often need to keep in memory all the file being parsed, because if a parse error happens, some new alternative needs to be tried.

For instance,

p = (char 'a' >> char 'b' >> ...) <|>
    (char 'a' >> char 'c' >> ...)

will not work as one might expect. As soon as 'a' is consumed successfully by the first branch, parsec commits on it. This means that if 'c' follows, then the first branch will fail but it's too late for the second one to be tried. The whole parser simply fails.

To solve this one can factor out the common prefix, e.g.

p = char 'a' >> ( (char 'b' >> ...) <|> (char 'c' >> ...) )

or resort to try,

p = (try (char 'a' >> char 'b') >> ...) <|>
    (char 'a' >> char 'c' >> ...)

try basically tells parsec not to commit to the branch until the whole parser under try succeeds. If abused, it can cause parsec to keep in memory a large portion of the file, but the user at least has some control on that. Theoretically, perfect nondeterminism can be recovered by always adding try to the whole left branch of <|>. This however is not recommended, since it prods the programmer into writing an inefficient parser, both because of the memory issue and the fact that one can easily write a grammar requiring an exponential number of backtracks to successfully parse.

like image 34
chi Avatar answered Sep 20 '22 12:09

chi