Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why is parsecs "choice" combinator seemingly stuck on the first choice?

After looking at the CSV sample code in Real World Haskell, I've tried to build a little XML parser. But close tags error out with 'unexpected "/"' errors. Can you tell me why my "closeTag" parser doesn't work (or possibly isn't ever invoked)? Thanks!

import Text.ParserCombinators.Parsec

xmlFile = manyTill line eof
line = manyTill tag eol
eol = char '\n'

word = many1 (noneOf "></")

tag = choice [openTag, closeTag, nullTag, word]

nullTag = between (char '<') (string "/>") word
closeTag = between (string "</") (char '>') word
openTag = between (char '<') (char '>')  tagContent
attrval = between (char '"') (char '"') word

atts = do {
        (char ' ')
        ; sepBy attr (char ' ')
}

attr = do {
                word
                ; char '='
                ; attrval
        }

tagContent = do {
                w <- word
                ; option []  atts
                ; return w
        }

parseXML :: String -> Either ParseError [[String]]
parseXML input = parse xmlFile "(unknown)" input

main =
    do c <- getContents
       case parse xmlFile "(stdin)" c of
            Left e -> do putStrLn "Error parsing input:"
                         print e
            Right r -> mapM_ print r
like image 572
nont Avatar asked Apr 18 '11 02:04

nont


1 Answers

Parsec's strategy is essentially LL(1), which means that it "commits" to the current branch whenever any input is consumed. Your openTag parser consumes the < with its char '<', which means that if when it sees > instead of /, the whole parse fails instead of trying a new choice. If openTag didn't consume any input and failed, another choice would be tried. Parsec does this for efficiency (the alternative is exponential time!) and for reasonable error messages.

You have two options. The preferred option, when it is reasonable to pull off, is to factor your grammar so that all choices are made without consuming input, eg.:

tag = word <|> (char '<' >> tagbody)
    where
    tagbody = do
        content <- tagcontent
        choice [ string "/>", char '>' ]

Modulo errors and style (my brain is a bit fried at the moment :-P).

The other way, which locally changes parsec's semantics (at the expense of the aforementioned error messages and efficiency -- but it's not usually too bad because it's local), is to use the try combinator which allows a parser to consume input and still fail "softly" so another choice can be tried:

nulltag = try $ between (char '<') (string "/>") word
-- etc.

Sometimes using try is cleaner and easier than factoring like above, which can obscure the "deep structure" of the language. It's a stylistic trade-off.

like image 197
luqui Avatar answered Oct 22 '22 08:10

luqui