Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does the recognition of numbers belong in the scanner or in the parser?

When you look at the EBNF description of a language, you often see a definition for integers and real numbers:

integer  ::= digit digit*   // Accepts numbers with a 0 prefix
real     ::= integer "." integer (('e'|'E') integer)?

(Definitions were made on the fly, I have probably made a mistake in them).

Although they appear in the context-free grammar, numbers are often recognized in the lexical analysis phase. Are they included in the language definition to make it more complete and it is up to the implementer to realize that they should actually be in the scanner?

like image 777
gnuvince Avatar asked Oct 11 '22 05:10

gnuvince


1 Answers

Many common parser generator tools -- such as ANTLR, Lex/YACC -- separate parsing into two phases: first, the input string is tokenized. Second, the tokens are combined into productions to create a concrete syntax tree.

However, there are alternative techniques that do not require tokenization: check out backtracking recursive-descent parsers. For such a parser, tokens are defined in a similar way to non-tokens. pyparsing is a parser generator for such parsers.

The advantage of the two-step technique is that it usually produces more efficient parsers -- with tokens, there's a lot less string manipulation, string searching, and backtracking.

According to "The Definitive ANTLR Reference" (Terence Parr),

The only difference between [lexers and parsers] is that the parser recognizes grammatical structure in a stream of tokens while the lexer recognizes structure in a stream of characters.

like image 53
Matt Fenwick Avatar answered Oct 12 '22 19:10

Matt Fenwick