Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the cleanest way to do case-insensitive parsing with Text.Combinators.Parsec?

Tags:

haskell

parsec

I'm writing my first program with Parsec. I want to parse MySQL schema dumps and would like to come up with a nice way to parse strings representing certain keywords in case-insensitive fashion. Here is some code showing the approach I'm using to parse "CREATE" or "create". Is there a better way to do this? An answer that doesn't resort to buildExpressionParser would be best. I'm taking baby steps here.

  p_create_t :: GenParser Char st Statement
  p_create_t = do
      x <- (string "CREATE" <|> string "create")
      xs <- manyTill anyChar (char ';')
      return $ CreateTable (x ++ xs) []  -- refine later
like image 739
dan Avatar asked Oct 17 '12 15:10

dan


3 Answers

You can build the case-insensitive parser out of character parsers.

-- Match the lowercase or uppercase form of 'c'
caseInsensitiveChar c = char (toLower c) <|> char (toUpper c)

-- Match the string 's', accepting either lowercase or uppercase form of each character 
caseInsensitiveString s = try (mapM caseInsensitiveChar s) <?> "\"" ++ s ++ "\""
like image 96
Heatsink Avatar answered Nov 04 '22 19:11

Heatsink


Repeating what I said in a comment, as it was apparently helpful:

The simple sledgehammer solution here is to simply map toLower over the entire input before running the parser, then do all your keyword matching in lowercase.

This presents obvious difficulties if you're parsing something that needs to be case-insensitive in some places and case-sensitive in others, or if you care about preserving case for cosmetic reasons. For example, although HTML tags are case-insensitive, converting an entire webpage to lowercase while parsing it would probably be undesirable. Even when compiling a case-insensitive programming language, converting identifiers could be annoying, as any resulting error messages would not match what the programmer wrote.

like image 30
C. A. McCann Avatar answered Nov 04 '22 17:11

C. A. McCann


No, Parsec cannot do that in clean way. string is implemented on top of primitive tokens combinator that is hard-coded to use equality test (==). It's a bit simpler to parse case-insensitive character, but you probably want more.

There is however a modern fork of Parsec, called Megaparsec which has built-in solutions for everything you may want:

λ> parseTest (char' 'a') "b"
parse error at line 1, column 1:
unexpected 'b'
expecting 'A' or 'a'
λ> parseTest (string' "foo") "Foo"
"Foo"
λ> parseTest (string' "foo") "FOO"
"FOO"
λ> parseTest (string' "foo") "fo!"
parse error at line 1, column 1:
unexpected "fo!"
expecting "foo"

Note the last error message, it's better than what you can get parsing characters one by one (especially useful in your particular case). string' is implemented just like Parsec's string but uses case-insensitive comparison to compare characters. There are also oneOf' and noneOf' that may be helpful in some cases.


Disclosure: I'm one of the authors of Megaparsec.

like image 4
Mark Karpov Avatar answered Nov 04 '22 17:11

Mark Karpov