Does the recognition of numbers belong in the scanner or in the parser?

Question

When you look at the EBNF description of a language, you often see a definition for integers and real numbers:

integer  ::= digit digit*   // Accepts numbers with a 0 prefix
real     ::= integer "." integer (('e'|'E') integer)?

(Definitions were made on the fly, I have probably made a mistake in them).

Although they appear in the context-free grammar, numbers are often recognized in the lexical analysis phase. Are they included in the language definition to make it more complete and it is up to the implementer to realize that they should actually be in the scanner?

Matt Fenwick · Accepted Answer

Many common parser generator tools -- such as ANTLR, Lex/YACC -- separate parsing into two phases: first, the input string is tokenized. Second, the tokens are combined into productions to create a concrete syntax tree.

However, there are alternative techniques that do not require tokenization: check out backtracking recursive-descent parsers. For such a parser, tokens are defined in a similar way to non-tokens. pyparsing is a parser generator for such parsers.

The advantage of the two-step technique is that it usually produces more efficient parsers -- with tokens, there's a lot less string manipulation, string searching, and backtracking.

According to "The Definitive ANTLR Reference" (Terence Parr),

The only difference between [lexers and parsers] is that the parser recognizes grammatical structure in a stream of tokens while the lexer recognizes structure in a stream of characters.

Does the recognition of numbers belong in the scanner or in the parser?

Tags:

parsing

lexical-analysis

gnuvince

1 Answers

Matt Fenwick

Recent Activity

Donate For Us

Does the recognition of numbers belong in the scanner or in the parser?

Tags:

parsing

lexical-analysis

gnuvince

1 Answers

Matt Fenwick

Related questions

Recent Activity

Donate For Us