Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where should I draw the line between lexer and parser?

I'm writing a lexer for the IMAP protocol for educational purposes and I'm stumped as to where I should draw the line between lexer and parser. Take this example of an IMAP server response:

* FLAGS (\Answered \Deleted)

This response is defined in the formal syntax like this:

mailbox-data   = "FLAGS" SP flag-list
flag-list      = "(" [flag *(SP flag)] ")"
flag           = "\Answered" / "\Deleted"

Since they are specified as string literals (aka "terminal" tokens) would it be more correct for the lexer to emit a unique token for each, like:

(TknAnsweredFlag)
(TknSpace)
(TknDeletedFlag)

Or would it be just as correct to emit something like this:

(TknBackSlash)
(TknString "Answered")
(TknSpace)
(TknBackSlash)
(TknString "Deleted")

My confusion is that the former method could overcomplicate the lexer - if \Answered had two meanings in two different contexts the lexer wouldn't emit the right token. As a contrived example (this situation won't occur because e-mail addresses are enclosed in quotes), how would the lexer deal with an e-mail address like \[email protected]? Or is the formal syntax designed to never allow such an ambiguity to arise?

like image 984
duck9 Avatar asked Mar 19 '11 12:03

duck9


2 Answers

As a general rule, you don't want lexical syntax to propagate into the grammar, because its just detail. For instance, a lexer for a computer programming langauge like C would certainly recognize numbers, but it is generally inappropriate to produce HEXNUMBER and DECIMALNUMBER tokens, because this isn't important to the grammar.

I think what you want are the most abstract tokens that allows your grammar to distinguish cases of interest relative to your purpose. You get to mediate this by confusion caused in one part of the grammar, by choices you might make in other parts.

If your goal is simply to read past the flag values, then in fact you don't need to distinguish among them, and a TknFlag with no associated content would be good enough.

If your goal is to process the flag values individually, you need to know if you got an ANSWERED and/or DELETED indications. How they are lexically spelled is irrelevant; so I'd go with your TknAnsweredFlag solution. I would dump the TknSpace, because in any sequence of flags, there must be intervening spaces (your spec say so), so I'd try to eliminate using whatever whitespace supression machinery you lexer offers.

On occasion, I run into situations where there are dozens of such flag-like things. Then your grammar starts to become cluttered if you have a token for each. If the grammar doesn't need to know specific flags, then you should have a TknFlag with associated string value. If a small subset of the flags are needed by the grammar to distinguish, but most of them are not, then you should compromise: have individual tokens for those flags that matter to the grammar, and a catch all TknFlag with associated string for the rest.

Regarding the difficulty in having two different interpretations: this is one of those tradeoffs. If you have that issue, then your tokens either need to have fine enough detail in both places where they are needed in the grammar so you can discriminate. If "\" is relevant as a token somewhere else in the grammar, you certainly could produce both TknBackSlash and TknAnswered. However, if the way something is treated in one part of the grammar is different than another, you can often get around this using a mode-driven lexer. Think of modes as being a finite state machine, each with an associated (sub)lexer. Transitions between modes are triggered by tokens that are cues (you must have a FLAGS token; it is preciseuly such a cue that you are about to pick up flag values). In a mode, you can produce tokens that other modes would not produce; thus in one mode, you might produce "\" tokens, but in your flag mode you wouldn't need to. Mode support is pretty common in lexers because this problem is more common that you might expect. See Flex documentation for an example.

The fact that you are asking the question shows you are on the right track for making a good choice. You need to balance the maintainability goal of minimizing tokens (technically you can parse using a token for ever ASCII character!) with fundamental require to discriminate well enough for your needs. After you've built a dozen grammars this tradeoff will seem easy, but I think the rules of thumb I've provided are pretty good.

like image 93
Ira Baxter Avatar answered Sep 26 '22 07:09

Ira Baxter


I'd come up with the CFG first and whatever terminals it needs to do its job is what the lexer should recognize; otherwise you are just guessing at the proper way to tokenize the string.

like image 28
Matt Wonlaw Avatar answered Sep 22 '22 07:09

Matt Wonlaw