When writing a parser in a parser combinator library like Haskell's Parsec, you usually have 2 choices: <ul> <li>Write a lexer to split your <code>String</code> input into tokens, then perform parsing on <code>[Token]</code> </li> <li>Directly write parser combinators on <code>String</code> </li> </ul> The first method often seems to make sense given that many parsing inputs can be understood as tokens separated by whitespace. In other places, I have seen people recommend against tokenizing (or scanning or lexing, how some call it), with simplicity being quoted as the main reason. What are general trade-offs between lexing and not doing it?

The most important difference is that lexing will translate your input domain. A nice result of this is that <ul> <li>You do not have to think about whitespace anymore. In a direct (non-lexing) parser, you have to sprinkle <code>space</code> parsers in all places where whitespace is allowed to be, which is easy to forget and it clutters your code if whitespace must separate all your tokens anyway.</li> <li>You can think about your input in a piece-by-piece manner, which is easy for humans.</li> </ul> However, if you do perform lexing, you get the problems that <ul> <li>You cannot use common parsers on <code>String</code> anymore - e.g. for parsing a number with a library Function <code>parseFloat :: Parsec String s Float</code> (that operates on a String input stream), you have to do something like <code>takeNextToken :: TokenParser String</code> and <code>execute</code> the <code>parseFloat</code> parser on it, inspecting the parse result (usually <code>Either ErrorMessage a</code>). This is messy to write and limits composability.</li> <li>You have adjust all error messages. If your parser on tokens fails at the 20th token, where in the input string is that? You'll have to manually map error locations back to the input string, which is tedious (in Parsec this means to adjust all <code>SourcePos</code> values).</li> <li>Error reporting is generally worse. Running <code>string "hello" *> space *> float</code> on wrong input like <code>"hello4"</code> will tell you precisely that there is expected whitespace missing after the <code>hello</code>, while a lexer will just claim to have found an <code>"invalid token"</code>.</li> <li>Many things that one would expect to be atomic units and to be separated by a lexer are actually pretty "too hard" for a lexer to identify. Take for example String literals - suddenly <code>"hello world"</code> are not two tokens <code>"hello</code> and <code>world"</code> anymore (but only, of course, if quotes are not escaped, like <code>\"</code>) - while this is very natural for a parser, it means complicated rules and special cases for a lexer.</li> <li>You cannot re-use parsers on tokens as nicely. If you define how to parse a double out of a <code>String</code>, export it and the rest of the world can use it; they cannot run your (specialized) tokenizer first.</li> <li>You are stuck with it. When you are developing the language to parse, using a lexer might lead you into making early decisions, fixing things that you might want to change afterwards. For example, imagine you defined a language that contains some <code>Float</code> token. At some point, you want to introduce negative literals (<code>-3.4</code> and <code>- 3.4</code>) - this might not be possible due to the lexer interpreting whitespace as token separator. Using a parser-only approach, you can stay more flexible, making changes to your language easier. This is not really surprising since a parser is a more complex tool that inherently encodes rules.</li> </ul> To summarize, I would recommend writing lexer-free parsers for most cases. In the end, a lexer is just a "dumbed-down"* parser - if you need a parser anyway, combine them into one. <hr> * From computing theory, we know that all regular languages are also context-free languages; lexers are usually regular, parsers context-free or even context-sensitve (monadic parsers like Parsec can express context-sensitiveness).

Should I use a lexer when using a parser combinator library like Parsec?

1 Answers

The most important difference is that lexing will translate your input domain.

A nice result of this is that

You do not have to think about whitespace anymore. In a direct (non-lexing) parser, you have to sprinkle space parsers in all places where whitespace is allowed to be, which is easy to forget and it clutters your code if whitespace must separate all your tokens anyway.
You can think about your input in a piece-by-piece manner, which is easy for humans.

However, if you do perform lexing, you get the problems that

You cannot use common parsers on String anymore - e.g. for parsing a number with a library Function parseFloat :: Parsec String s Float (that operates on a String input stream), you have to do something like takeNextToken :: TokenParser String and execute the parseFloat parser on it, inspecting the parse result (usually Either ErrorMessage a). This is messy to write and limits composability.
You have adjust all error messages. If your parser on tokens fails at the 20th token, where in the input string is that? You'll have to manually map error locations back to the input string, which is tedious (in Parsec this means to adjust all SourcePos values).
Error reporting is generally worse. Running string "hello" *> space *> float on wrong input like "hello4" will tell you precisely that there is expected whitespace missing after the hello, while a lexer will just claim to have found an "invalid token".
Many things that one would expect to be atomic units and to be separated by a lexer are actually pretty "too hard" for a lexer to identify. Take for example String literals - suddenly "hello world" are not two tokens "hello and world" anymore (but only, of course, if quotes are not escaped, like \") - while this is very natural for a parser, it means complicated rules and special cases for a lexer.
You cannot re-use parsers on tokens as nicely. If you define how to parse a double out of a String, export it and the rest of the world can use it; they cannot run your (specialized) tokenizer first.
You are stuck with it. When you are developing the language to parse, using a lexer might lead you into making early decisions, fixing things that you might want to change afterwards. For example, imagine you defined a language that contains some Float token. At some point, you want to introduce negative literals (-3.4 and - 3.4) - this might not be possible due to the lexer interpreting whitespace as token separator. Using a parser-only approach, you can stay more flexible, making changes to your language easier. This is not really surprising since a parser is a more complex tool that inherently encodes rules.

To summarize, I would recommend writing lexer-free parsers for most cases.

In the end, a lexer is just a "dumbed-down"* parser - if you need a parser anyway, combine them into one.

_{* From computing theory, we know that all regular languages are also context-free languages; lexers are usually regular, parsers context-free or even context-sensitve (monadic parsers like Parsec can express context-sensitiveness).}

answered Oct 19 '22 16:10

nh2

Related questions
                            
                                Why does multiplication only short circuit on one side
                            
                                Difference between TVar and TMVar
                            
                                If return a = return b then does a=b?
                            
                                What is the connection between laziness and purity?
                            
                                Function privacy and unit testing Haskell
                            
                                Do ghc-compiled binaries require GHC or are they self-contained?
                            
                                What is the Store comonad?
                            
                                Haskell file reading
                            
                                Can I constrain a type family?
                            
                                functions as applicative functors (Haskell / LYAH)
                            
                                Cabal not installing dependencies when needing profiling libraries?
                            
                                Accessing members of a custom data type in Haskell
                            
                                Is everything in Haskell stored in thunks, even simple values?
                            
                                Why is Scala's type inference not as powerful as Haskell's?
                            
                                Is Haskell suitable as a first language?
                            
                                What other ways can state be handled in a pure functional language besides with Monads?
                            
                                Is Haskell really a purely functional language considering unsafePerformIO?
                            
                                Why should I want to learn haskell? [duplicate]
                            
                                Explanation of “tying the knot”
                            
                                Counting number of elements in a list that satisfy the given predicate

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Should I use a lexer when using a parser combinator library like Parsec?

Tags:

haskell

lexer

parsec

nh2

People also ask

1 Answers

nh2

Recent Activity

Donate For Us