Using Haskell Parsec to parse a regular expression in one pass

Question

Initial explanation

I am trying to do some testing with a custom regular expression engine but I got tired of writing the NFAs out by hand so I tried to make a parser with little success. Usually when people parse a regex, they create multiple intermediate structures that are eventually converted into the final machine. For my simple NFA definition, I believe the parsing could actually be done in one pass, though I have not yet determined either (a) why it actually can't or (b) how to do it, though my parser can parse very simple statements.

The (simplified) state entities are defined like so [1]:

type Tag = Int

data State a =
    Literal Tag a (State a)
  | Split (State a) (State a)
  | OpenGroup Tag (State a)
  | CloseGroup Tag (State a)
  | Accept                     -- end of expression like "abc"
  | Final                      -- end of expression like "abc$"

The tags are to allow a Show and Eq instance even though the final NFA can contain cycles. For example, to model the expression

-- "^a+(b*)c$"

I can use [2]

c = Literal 3 'c' $ Final 1
b = OpenGroup 1 $ Literal 2 'b' bs
bs = Split b $ CloseGroup 1 c
expr = Literal 1 'a' $ Split expr bs

I made a functioning stack machine-based parser for this grammar (sans the group tags) by porting a C implementation of the Thompson NFA implementation to Haskell but this requires two passes [3] to build and would require a third to end up in the structure described above.

So to build this structure via Parsec, I read the entry on this site about recursively building a List-like structure and came up with the following:

import           Control.Applicative
import           Text.Parsec         hiding (many, optional, (<|>))
import           Text.ExpressionEngine.Types
import qualified Text.ExpressionEngine.Types as T

type ParserState = Int
type ExpParser = Parsec String ParserState
type ExpParserS a = ExpParser (T.State a)

parseExpression :: String -> T.State Char
parseExpression e = case runParser p 1 e e of
  Left err -> error $ show err
  Right r -> r
where
  p = p_rec_many p_char $ p_end 1

p_rec_many :: ExpParser (T.State a -> T.State a) -> ExpParserS a -> ExpParserS a
p_rec_many p e = many'
  where
    many' = p_some <|> e
    p_some = p <*> many'

p_end :: Int -> ExpParserS a
p_end n = (Final n <$ char '$') <|> (Accept n <$ eof)

step_index :: ExpParser Int
step_index = do
  index <- getState
  updateState succ
  return index

p_char = do
  c <- noneOf "^.[$()|*+?{\"
  i <- step_index
  return $ Literal i c

And this much is sufficient to parse strings like "ab" and "abc$" [4].

The problem

The problem comes when I get to the next step: parsing the '|' for or statements. The way this should work is, a string like:

-- "(a|b|c)$"

should create the following structure:

final = Final 1
c = Literal 3 'c' final
b = Split (Literal 2 'b' final) c
a = Split (Literal 1 'a' final) b

So that means the parser that will build or statements must take the alternative expression that comes after it and pass it to all branches (I don't believe that changing Split to take a list instead changes anything because each entry still has to receive the same following expression). My attempt was:

p_regex :: T.State Char -> ExpParser (T.State Char)
p_regex e = do
  first <- p_rec_many p_char $ pure e
  (Split first <$> (char '|' *> p_regex e)) <|> return first

And the main parser changes to:

parseExpression :: String -> T.State Char
parseExpression e = case runParser p 1 e e of
  Left err -> error $ show err
  Right r -> r
  where
    p = p_regex <*> p_end 1

But this fails to type check [5]. I would expect this to be correct because p_regex has to have a built (State a) object fed to it and building "Literal" chains with p_rec_many also seems to work this way.

Perhaps I should be using buildExpressionTable? That might help with this particular issue because I could make ('$' <|> eof) the highest priority. I started an attempt at it but I can't imagine how I will handle things like the star plus, and question mark operators since those all have to reference themselves.

(EDIT: I tried with buildExpressionTable again and it appears to me that it would be far too simplistic for what I want to do. It can't natively handle stacked postfix operators [e.g. "a?*"], and my plan of making "'$' <|> eof" the highest priority won't work either because that will only attach to the last parsed 'term', not the entire string. Even if I could make it so, the '$' operator would be applied backwards: it should be the very last parsed term and fed to the preceding term. The more I work with this the more I wonder if I shouldn't just reverse the expression string before parsing it.)

Question

So, what am I doing wrong? I'm sure there is a way to do what I'm trying to do I just haven't been able to figure it out so far. Thank you for your time.

Footnotes

[1] If you want to see exactly what I'm actually using it can be found here.

[2] The idea for the Open/CloseGroup tags is to track group matches while the NFA runs. The placement in the listed expression may not be exactly right, but this way would work properly if encountered CloseGroup tags only create a capture group if a corresponding OpenGroup is found (i.e. in the above example, we would only create a capture if at least one 'b' was seen).

All other tag construction is correct and I've tested that this NFA matches strings as expected.

[3] The Thompson implementation is described here and my port of it can be seen here. This builds the NFA subset perfectly but in the resulting structure every next State will be wrapped in a Just. This is because I use Nothing to represent dangling pointers and a later step will patch in the correct next step. I could convert this structure into the one listed above by converting all the (Just state) entries into (state) entries but that would be a third pass. This implementation already requires a first pass to convert the regex into postfix notation.

[4] Resulting in

Literal 1 'a' (Literal 2 'b' (Accept 1))

and

Literal 1 'a' (Literal 2 'b' (Literal 3 'c' (Final 1)))

respectively.

[5]

Couldn't match expected type `a0 -> b0'
        with actual type `ExpParser (T.State Char)'
Expected type: T.State Char -> a0 -> b0
Actual type: T.State Char -> ExpParser (T.State Char)
In the first argument of `(<*>)', namely `p_regex'
In the expression: p_regex <*> p_end 1

MathematicalOrchid · Accepted Answer

You're probably not getting many answers because this is a huge question which requires reading a huge paper before anyone can think about writing an answer.

Having said that, by a freak coincidence, I just happen to be trying to build NFAs from regexes myself this week. ;-)

OK, so the immediate problem is

Couldn't match expected type `x -> y` with actual type `Parser`.

In English, this means that somewhere you have a function instead of a parser. A quick glance at your code shows that you've written

where
  p = p_regex <*> p_end 1

But p_regex takes 1 argument, and you haven't supplied one. That is why your code does not type-check.

OK, so stepping back, what's your actual problem? You want to parse a regex into an NFA, but the paper wants you to transform the regex into postfix notation, then parse it, then build the NFA?

It looks like it ought to be possible. When I implemented this, I had parsing and NFA generation as separate steps, purely so I can check that the parser works and that the NFA generation works separately. But it sounds like it ought to be possible. Parsec lets you have user state, so you can use that as a stack to store NFA fragments. (Or explicitly pass it around, if you prefer.)

If you want a more exact answer, you're probably going to need to trim this down to a smaller, more focused question.

Jason · Answer

Ok, so the question was basically: given a recursive data structure (defined in the question) how can I create a parser that will build my expression in one pass. My initial attempt was kind of "applicative" in nature. I was able to build up the recursive structure provided there was no conditional branching. But for regular expression parsing, branching is required which is why my approach wouldn't work for or statements.

So to solve this, I needed to have some state. A nice way to carry state in a functional language is with partially applied functions. I already had a base for this in that the signature for p_char above is:

p_char :: ExpParser (T.State Char -> T.State Char)

So what I need to combine them are combinators that compose multiple (T.State Char -> T.State Char) functions together. So with this insight, sequencing becomes:

p_many1 :: ExpParser (T.State Char -> T.State Char) -> ExpParser (T.State Char -> T.State Char)
p_many1 p = do
    f <- p
    (p_many1 p >>= return . (f .)) <|> return f

Now for the or statement what we need is something that takes an expression like "a|b|c" and create a function like:

\e -> Split (Literal 1 'a' e) (Split (Literal 2 'b' e) (Literal 3 'c' e))

So to do that we can use this:

p_splitBy1 :: ExpParser (T.State Char -> T.State Char) -> Parsec String ParserState Char -> ExpParser (T.State Char -> T.State Char)
p_splitBy1 p sep = do
    f <- p
    (sep >> p_splitBy1 p sep >>= return . (\f' e -> Split (f e) (f' e))) <|> return f

And that does indeed create the structure I need. So if anyone else encounters a similar problem in the future, maybe this question/answer can be of use.

Using Haskell Parsec to parse a regular expression in one pass

Tags:

regex

haskell

parsec

Jason

2 Answers

MathematicalOrchid

Jason

Recent Activity

Donate For Us

Using Haskell Parsec to parse a regular expression in one pass

Tags:

regex

haskell

parsec

Jason

2 Answers

MathematicalOrchid

Jason

Related questions

Recent Activity

Donate For Us