Are lexers and parsers really that different in theory?
It seems fashionable to hate regular expressions: coding horror, another blog post.
However, popular lexing based tools: pygments, geshi, or prettify, all use regular expressions. They seem to lex anything...
When is lexing enough, when do you need EBNF?
Has anyone used the tokens produced by these lexers with bison or antlr parser generators?
The lexer just turns the meaningless string into a flat list of things like "number literal", "string literal", "identifier", or "operator", and can do things like recognizing reserved identifiers ("keywords") and discarding whitespace. Formally, a lexer recognizes some set of Regular languages.
A lexer is the part of an interpreter that turns a sequence of characters (plain text) into a sequence of tokens. A parser, in turn, takes a sequence of tokens and produces an abstract syntax tree (AST) of a language. The rules by which a parser operates are usually specified by a formal grammar.
A parser is a compiler or interpreter component that breaks data into smaller elements for easy translation into another language. A parser takes input in the form of a sequence of tokens, interactive commands, or program instructions and breaks them up into parts that can be used by other components in programming.
What parsers and lexers have in common:
*
, ==
, <=
, ^
will be classified as "operator" token by the C/C++ lexer.[number][operator][number]
, [id][operator][id]
, [id][operator][number][operator][number]
will be classified as "expression" nonterminal by the C/C++ parser.[TXT][TAG][TAG][TXT][TAG][TXT]...
.As you can see, parsers and tokenizers have much in common. One parser can be a tokenizer for other parser, which reads its input tokens as symbols from its own alphabet (tokens are simply symbols of some alphabet) in the same way as sentences from one language can be alphabetic symbols of some other, higher-level language. For example, if *
and -
are the symbols of the alphabet M
(as "Morse code symbols"), then you can build a parser which recognizes strings of these dots and lines as letters encoded in the Morse code. The sentences in the language "Morse Code" could be tokens for some other parser, for which these tokens are atomic symbols of its language (e.g. "English Words" language). And these "English Words" could be tokens (symbols of the alphabet) for some higher-level parser which understands "English Sentences" language. And all these languages differ only in the complexity of the grammar. Nothing more.
So what's all about these "Chomsky's grammar levels"? Well, Noam Chomsky classified grammars into four levels depending on their complexity:
a
,b
), their concatenations (ab
,aba
,bbb
etd.), or alternatives (e.g. a|b
).(()()(()()))
, nested HTML/BBcode tags, nested blocks etc. It's because state automata to deal with it should have to have infinitely many states to handle infinitely many nesting levels.x+3
and in one context this x
could be a name of a variable, and in other context it could be a name of a function etc.Yes, they are very different in theory, and in implementation.
Lexers are used to recognize "words" that make up language elements, because the structure of such words is generally simple. Regular expressions are extremely good at handling this simpler structure, and there are very high-performance regular-expression matching engines used to implement lexers.
Parsers are used to recognize "structure" of a language phrases. Such structure is generally far beyond what "regular expressions" can recognize, so one needs "context sensitive" parsers to extract such structure. Context-sensitive parsers are hard to build, so the engineering compromise is to use "context-free" grammars and add hacks to the parsers ("symbol tables", etc.) to handle the context-sensitive part.
Neither lexing nor parsing technology is likely to go away soon.
They may be unified by deciding to use "parsing" technology to recognize "words", as is currently explored by so-called scannerless GLR parsers. That has a runtime cost, as you are applying more general machinery to what is often a problem that doesn't need it, and usually you pay for that in overhead. Where you have lots of free cycles, that overhead may not matter. If you process a lot of text, then the overhead does matter and classical regular expression parsers will continue to be used.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With