Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Parsec, is there a way to prevent lexeme from consuming newlines?

Tags:

haskell

parsec

All of the parsers in Text.Parsec.Token politely use lexeme to eat whitespace after a token. Unfortunately for me, whitespace includes new lines, which I want to use as expression terminators. Is there a way to convince lexeme to leave a new line?

like image 227
John F. Miller Avatar asked Apr 15 '11 03:04

John F. Miller


3 Answers

No, it is not. Here is the relevant code.

From Text.Parsec.Token:

lexeme p
    = do{ x <- p; whiteSpace; return x  }


--whiteSpace
whiteSpace
    | noLine && noMulti  = skipMany (simpleSpace <?> "")
    | noLine             = skipMany (simpleSpace <|> multiLineComment <?> "")
    | noMulti            = skipMany (simpleSpace <|> oneLineComment <?> "")
    | otherwise          = skipMany (simpleSpace <|> oneLineComment <|> multiLineComment <?> "")
    where
      noLine  = null (commentLine languageDef)
      noMulti = null (commentStart languageDef)

One will notice in the where clause of whitespace that the only only options looked at deal with comments. The lexeme function uses whitespace and it is used liberally in the rest of parsec.token.


Update Sept. 28, 2015

The ultimate solution for me was to use a proper lexical analyser (alex). Parsec does a very good job as a parsing library and it is a credit to the design that it can be mangled into doing lexical analysis, but for all but small and simple projects it will quickly become unwieldy. I now use alex to create a linear set of tokens and then Parsec turns them into an AST.

like image 95
John F. Miller Avatar answered Nov 14 '22 08:11

John F. Miller


If newlines are your expression terminators, maybe it would make sense to split the input at each newline and parsing each line on its own.

like image 39
bzn Avatar answered Nov 14 '22 08:11

bzn


Well, not all parsers in Text.Parsec.Token use lexeme, although all of them should. Worst of all it's not documented which of them consume white space and which of them do not. Some of the parsers in Text.Parsec.Token do consume white space after lexeme, some of them don't. Some of them consume leading whitespace as well. You should read existing issues on GitHub issue tracker if you want to control the situation fully.

In particular:

  • decimal, hexadecimal, and octal parsers do not consume trailing white space, see the source, and this issue;

  • integer consumes leading whitespace as well, see this issue;

  • rest of them probably consume trailing whitespace and thus newlines, this is however difficult to tell for sure because Parsec's code is particularly hairy (IMHO) and the project has no test suite (except for 3 tests which checks that already fixed bugs do not show up again, however it's not enough to prevent regressions and every change in source may break your code in next release of Parsec.)

There are various propositions how to make it configurable (what should be considered white space), none of them is merged or commented on for some reason.

But the real problem is rather in design of Text.Parsec.Token, which locks user into solutions built by makeTokenParser. This design is particularly non-flexible. There are many cases when only one solution is to copy the entire module and edit it as needed.

But if you want modern and consistent Parsec there is an option to switch to Megaparsec where this (and many others) problem is non-existent.


Disclosure: I'm one of the authors of Megaparsec.

like image 2
Mark Karpov Avatar answered Nov 14 '22 09:11

Mark Karpov