Using Parsec to parse regular expressions

Tags:

I'm trying to learn Parsec by implementing a small regular expression parser. In BNF, my grammar looks something like:

EXP  : EXP *
     | LIT EXP
     | LIT

I've tried to implement this in Haskell as:

expr = try star
       <|> try litE
       <|> lit

litE  = do c <- noneOf "*"
           rest <- expr
           return (c : rest)

lit   = do c <- noneOf "*"
           return [c]

star = do content <- expr
          char '*'
          return (content ++ "*")

There are some infinite loops here though (e.g. expr -> star -> expr without consuming any tokens) which makes the parser loop forever. I'm not really sure how to fix it though, because the very nature of star is that it consumes its mandatory token at the end.

Any thoughts?

932

asked Jan 26 '12 15:01

Xodarap

2 Answers

You should use Parsec.Expr.buildExprParser; it is ideal for this purpose. You simply describe your operators, their precedence and associativity, and how to parse an atom, and the combinator builds the parser for you!

You probably also want to add the ability to group terms with parens so that you can apply * to more than just a single literal.

Here's my attempt (I threw in |, +, and ? for good measure):

import Control.Applicative
import Control.Monad
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Expr

data Term = Literal Char
          | Sequence [Term]
          | Repeat (Int, Maybe Int) Term
          | Choice [Term]
  deriving ( Show )

term :: Parser Term
term = buildExpressionParser ops atom where

  ops = [ [ Postfix (Repeat (0, Nothing) <$ char '*')
          , Postfix (Repeat (1, Nothing) <$ char '+')
          , Postfix (Repeat (0, Just 1)  <$ char '?')
          ]
        , [ Infix (return sequence) AssocRight
          ]
        , [ Infix (choice <$ char '|') AssocRight
          ]
        ]

  atom = msum [ Literal <$> lit
              , parens term
              ]

  lit = noneOf "*+?|()"
  sequence a b = Sequence $ (seqTerms a) ++ (seqTerms b)
  choice a b = Choice $ (choiceTerms a) ++ (choiceTerms b)
  parens = between (char '(') (char ')')

  seqTerms (Sequence ts) = ts
  seqTerms t = [t]

  choiceTerms (Choice ts) = ts
  choiceTerms t = [t]

main = parseTest term "he(llo)*|wor+ld?"

answered Nov 03 '22 17:11

pat

Your grammar is left-recursive, which doesn’t play nice with try, as Parsec will repeatedly backtrack. There are a few ways around this. Probably the simplest is just making the * optional in another rule:

lit :: Parser (Char, Maybe Char)
lit = do
  c <- noneOf "*"
  s <- optionMaybe $ char '*'
  return (c, s)

Of course, you’ll probably end up wrapping things in a data type anyway, and there are a lot of ways to go about it. Here’s one, off the top of my head:

import Control.Applicative ((<$>))

data Term = Literal Char
          | Sequence [Term]
          | Star Term

expr :: Parser Term
expr = Sequence <$> many term

term :: Parser Term
term = do
  c <- lit
  s <- optionMaybe $ char '*' -- Easily extended for +, ?, etc.
  return $ if isNothing s
    then Literal c
    else Star $ Literal c

Maybe a more experienced Haskeller will come along with a better solution.

answered Nov 03 '22 17:11

Jon Purdy

Related questions
                            
                                Using ANTLR to analyze and modify source code; am I doing it wrong?
                            
                                How to prevent Gson serialize / deserialize the first character of a field (underscore)?
                            
                                javascript parseInt to remove spaces from a string
                            
                                best java Xml parser to manipulate/edit an existing xml document
                            
                                Example for LL(1) Grammar which is NOT LALR?
                            
                                Why is an anonymous function on its own a syntax error in javascript?
                            
                                Perl - Parse URL to get a GET Parameter Value
                            
                                Inline external CSS with HTML
                            
                                How to remove trailing comments via regexp?
                            
                                Parsing an equation with custom functions in Python
                            
                                Bison one or more occurrences in grammar file
                            
                                Parsing html using Selenium - class name contains spaces
                            
                                ANTLR AST rules fail with RewriteEmptyStreamException
                            
                                Wikipedia : Java library to remove wikipedia text markup removal
                            
                                How can I escape single or double quotation marks in CSS?
                            
                                What are the advantages of the "apply" functions? When are they better to use than "for" loops, and when are they not? [duplicate]
                            
                                Parsing JSON from HttpClient request using JSON.org parser
                            
                                Problems with PLY LEX and YACC
                            
                                Open-source parser code for Mediawiki markup [closed]
                            
                                When should I use a parser?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using Parsec to parse regular expressions

Tags:

parsing

haskell

grammar

context-free-grammar

parsec

Xodarap

People also ask

2 Answers

pat

Jon Purdy

Recent Activity

Donate For Us