Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Design guidelines for parser and lexer?

I'm writing a lexer (with re2c) and a parser (with Lemon) for a slightly convoluted data format: CSV-like, but with specific string types at specific places (alphanumeric chars only, alphanumeric chars and minus signs, any char except quotes and comma but with balanced braces, etc.), strings inside braces and strings that look like function calls with opening and closing braces that can contain parameters.

My first shot at it was a lexer with many states, each state catering to the specific string format. But after many unhelpful "unexpected input" messages from the lexer (which got very big) I realized that maybe it was trying to do the work of the parser. I scrapped my first try and went with a lexer with only one state, many character tokens and a parser that combines the tokens into the different string types. This works better, I get more helpful syntax errors from the parser when something is off, but it still feels not quite right. I am thinking of adding one or two states to the lexer, but initiating the states from the parser, which has a much better "overview" on which string type is required in a given instance. Overall I feel a bit stupid :(

I have no formal CS background and shy a bit away from the math-heavy theory. But maybe there is a tutorial or book somewhere that explains what a lexer should (and should not) do and which part of the work the parser should do. How to construct good token patterns, when to use lexer states, when and how to use recursive rules (with a LALR parser), how to avoid ambigous rules. A pragmatic cookbook that teaches the basics. The "Lex and YACC primer/HOWTO" was nice, but not enough. Since I just want to parse a data format, books on compiler building (like the red dragon book) look a bit oversized to me.

Or maybe someone can give me some plain rules here.

like image 353
chiborg Avatar asked Jul 07 '10 07:07

chiborg


People also ask

What is the difference between a lexer and a parser?

A lexer is a software program that performs lexical analysis. ... A parser goes one level further than thelexer and takes the tokens produced by the lexer and tries to determine if proper sentences have been formed. Parsers work at the grammatical level, lexerswork at the word level.

How does lexer and parser work?

A lexer and a parser work in sequence: the lexer scans the input and produces the matching tokens, the parser then scans the tokens and produces the parsing result.

What is a lexer in programming?

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, although scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.


1 Answers

What you should really do is write a grammar for your language. Once you have that, the boundary is easy:

  • The lexer is responsible for taking your input and telling you which terminal you have.
  • The parser is responsible for matching a series of terminals and nonterminals to a production rule, repeatedly, until you either have an Abstract Syntax Tree (AST) or a parse failure.

The lexer is not responsible for input validation except insofar as to reject impossible characters, and other very basic bits. The parser does all that.

Take a look at https://www.cs.rochester.edu/u/nelson/courses/csc_173/grammars/parsing.html . It's an intro CS course page on parsing.

like image 79
Borealid Avatar answered Sep 20 '22 15:09

Borealid