<p>Are lexers and parsers really that different in theory? </p> <p>It seems fashionable to hate regular expressions: coding horror, another blog post. </p> <p>However, popular lexing based tools: pygments, geshi, or prettify, all use regular expressions. They seem to lex anything... </p> <p>When is lexing enough, when do you need EBNF? </p> <p>Has anyone used the tokens produced by these lexers with bison or antlr parser generators?</p>

<p>What parsers and lexers have in common:</p> <ol> <li> <h3>They read <em>symbols</em> of some <em>alphabet</em> from their input.</h3> <ul> <li>Hint: The alphabet doesn't necessarily have to be of letters. But it has to be of symbols which are <strong>atomic</strong> for the language understood by parser/lexer.</li> <li>Symbols for the lexer: ASCII characters.</li> <li>Symbols for the parser: the particular tokens, which are terminal symbols of their grammar.</li> </ul> </li> <li> <h3>They analyse these <em>symbols</em> and try to match them with the <em>grammar</em> of the language they understood.<br> </h3> <ul> <li>Here's where the real difference usually lies. See below for more.</li> <li>Grammar understood by lexers: regular grammar (Chomsky's level 3).</li> <li>Grammar understood by parsers: context-free grammar (Chomsky's level 2).</li> </ul> </li> <li> <h3>They attach <em>semantics</em> (meaning) to the language pieces they find.</h3> <ul> <li>Lexers attach meaning by classifying <em>lexemes</em> (strings of symbols from the input) as the particular <em>tokens</em>. E.g. All these lexemes: <code>*</code>, <code>==</code>, <code><=</code>, <code>^</code> will be classified as "operator" token by the C/C++ lexer.</li> <li>Parsers attach meaning by classifying strings of tokens from the input (sentences) as the particular <em>nonterminals</em> and building the <em>parse tree</em>. E.g. all these token strings: <code>[number][operator][number]</code>, <code>[id][operator][id]</code>, <code>[id][operator][number][operator][number]</code> will be classified as "expression" nonterminal by the C/C++ parser.</li> </ul> </li> <li> <h3>They can attach some additional meaning (data) to the recognized elements.</h3> <ul> <li>When a lexer recognizes a character sequence constituting a proper number, it can convert it to its binary value and store with the "number" token. </li> <li>Similarly, when a parser recognize an expression, it can compute its value and store with the "expression" node of the syntax tree.</li> </ul> </li> <li> <h3>They all produce on their output a proper <em>sentences</em> of the language they recognize.</h3> <ul> <li>Lexers produce <em>tokens</em>, which are <em>sentences</em> of the <em>regular language</em> they recognize. Each token can have an inner syntax (though level 3, not level 2), but that doesn't matter for the output data and for the one which reads them.</li> <li>Parsers produce <em>syntax trees</em>, which are representations of <em>sentences</em> of the <em>context-free language</em> they recognize. Usually it's only one big tree for the whole document/source file, because the whole document/source file is a proper <em>sentence</em> for them. But there aren't any reasons why parser couldn't produce a series of syntax trees on its output. E.g. it could be a parser which recognizes SGML tags sticked into plain-text. So it'll <em>tokenize</em> the SGML document into a series of tokens: <code>[TXT][TAG][TAG][TXT][TAG][TXT]...</code>.</li> </ul> </li> </ol> <p>As you can see, parsers and tokenizers have much in common. One parser can be a tokenizer for other parser, which reads its input tokens as symbols from its own alphabet (tokens are simply symbols of some alphabet) in the same way as sentences from one language can be alphabetic symbols of some other, higher-level language. For example, if <code>*</code> and <code>-</code> are the symbols of the alphabet <code>M</code> (as "Morse code symbols"), then you can build a parser which recognizes strings of these dots and lines as letters encoded in the Morse code. The sentences in the language "Morse Code" could be <em>tokens</em> for some other parser, for which these <em>tokens</em> are atomic symbols of its language (e.g. "English Words" language). And these "English Words" could be tokens (symbols of the alphabet) for some higher-level parser which understands "English Sentences" language. And <strong>all these languages differ only in the complexity of the grammar</strong>. Nothing more.</p> <p>So what's all about these "Chomsky's grammar levels"? Well, Noam Chomsky classified grammars into four levels depending on their complexity:</p> <ul> <li> <h3>Level 3: Regular grammars</h3> They use regular expressions, that is, they can consist only of the symbols of alphabet (<code>a</code>,<code>b</code>), their concatenations (<code>ab</code>,<code>aba</code>,<code>bbb</code> etd.), or alternatives (e.g. <code>a|b</code>).<br>They can be implemented as finite state automata (FSA), like NFA (Nondeterministic Finite Automaton) or better DFA (Deterministic Finite Automaton).<br>Regular grammars can't handle with <em>nested syntax</em>, e.g. properly nested/matched parentheses <code>(()()(()()))</code>, nested HTML/BBcode tags, nested blocks etc. It's because state automata to deal with it should have to have infinitely many states to handle infinitely many nesting levels.</li> <li> <h3>Level 2: Context-free grammars</h3> They can have nested, recursive, self-similar branches in their syntax trees, so they can handle with nested structures well.<br>They can be implemented as state automaton with stack. This stack is used to represent the nesting level of the syntax. In practice, they're usually implemented as a top-down, recursive-descent parser which uses machine's procedure call stack to track the nesting level, and use recursively called procedures/functions for every non-terminal symbol in their syntax.<br>But they can't handle with a <em>context-sensitive</em> syntax. E.g. when you have an expression <code>x+3</code> and in one context this <code>x</code> could be a name of a variable, and in other context it could be a name of a function etc.</li> <li><h3>Level 1: Context-sensitive grammars</h3></li> <li><h3>Level 0: Unrestricted grammars<br>Also called recursively enumerable grammars.</h3></li> </ul>

lexers vs parsers

2 Answers

What parsers and lexers have in common:

They read symbols of some alphabet from their input.
- Hint: The alphabet doesn't necessarily have to be of letters. But it has to be of symbols which are atomic for the language understood by parser/lexer.
- Symbols for the lexer: ASCII characters.
- Symbols for the parser: the particular tokens, which are terminal symbols of their grammar.
They analyse these symbols and try to match them with the grammar of the language they understood.
- Here's where the real difference usually lies. See below for more.
- Grammar understood by lexers: regular grammar (Chomsky's level 3).
- Grammar understood by parsers: context-free grammar (Chomsky's level 2).
They attach semantics (meaning) to the language pieces they find.
- Lexers attach meaning by classifying lexemes (strings of symbols from the input) as the particular tokens. E.g. All these lexemes: *, ==, <=, ^ will be classified as "operator" token by the C/C++ lexer.
- Parsers attach meaning by classifying strings of tokens from the input (sentences) as the particular nonterminals and building the parse tree. E.g. all these token strings: [number][operator][number], [id][operator][id], [id][operator][number][operator][number] will be classified as "expression" nonterminal by the C/C++ parser.
They can attach some additional meaning (data) to the recognized elements.
- When a lexer recognizes a character sequence constituting a proper number, it can convert it to its binary value and store with the "number" token.
- Similarly, when a parser recognize an expression, it can compute its value and store with the "expression" node of the syntax tree.
They all produce on their output a proper sentences of the language they recognize.
- Lexers produce tokens, which are sentences of the regular language they recognize. Each token can have an inner syntax (though level 3, not level 2), but that doesn't matter for the output data and for the one which reads them.
- Parsers produce syntax trees, which are representations of sentences of the context-free language they recognize. Usually it's only one big tree for the whole document/source file, because the whole document/source file is a proper sentence for them. But there aren't any reasons why parser couldn't produce a series of syntax trees on its output. E.g. it could be a parser which recognizes SGML tags sticked into plain-text. So it'll tokenize the SGML document into a series of tokens: [TXT][TAG][TAG][TXT][TAG][TXT]....

As you can see, parsers and tokenizers have much in common. One parser can be a tokenizer for other parser, which reads its input tokens as symbols from its own alphabet (tokens are simply symbols of some alphabet) in the same way as sentences from one language can be alphabetic symbols of some other, higher-level language. For example, if * and - are the symbols of the alphabet M (as "Morse code symbols"), then you can build a parser which recognizes strings of these dots and lines as letters encoded in the Morse code. The sentences in the language "Morse Code" could be tokens for some other parser, for which these tokens are atomic symbols of its language (e.g. "English Words" language). And these "English Words" could be tokens (symbols of the alphabet) for some higher-level parser which understands "English Sentences" language. And all these languages differ only in the complexity of the grammar. Nothing more.

So what's all about these "Chomsky's grammar levels"? Well, Noam Chomsky classified grammars into four levels depending on their complexity:

Level 3: Regular grammars
They use regular expressions, that is, they can consist only of the symbols of alphabet (a,b), their concatenations (ab,aba,bbb etd.), or alternatives (e.g. a|b).
They can be implemented as finite state automata (FSA), like NFA (Nondeterministic Finite Automaton) or better DFA (Deterministic Finite Automaton).
Regular grammars can't handle with nested syntax, e.g. properly nested/matched parentheses (()()(()())), nested HTML/BBcode tags, nested blocks etc. It's because state automata to deal with it should have to have infinitely many states to handle infinitely many nesting levels.
Level 2: Context-free grammars
They can have nested, recursive, self-similar branches in their syntax trees, so they can handle with nested structures well.
They can be implemented as state automaton with stack. This stack is used to represent the nesting level of the syntax. In practice, they're usually implemented as a top-down, recursive-descent parser which uses machine's procedure call stack to track the nesting level, and use recursively called procedures/functions for every non-terminal symbol in their syntax.
But they can't handle with a context-sensitive syntax. E.g. when you have an expression x+3 and in one context this x could be a name of a variable, and in other context it could be a name of a function etc.
Level 1: Context-sensitive grammars
Level 0: Unrestricted grammars
Also called recursively enumerable grammars.

160

answered Oct 07 '22 19:10

SasQ

Yes, they are very different in theory, and in implementation.

Lexers are used to recognize "words" that make up language elements, because the structure of such words is generally simple. Regular expressions are extremely good at handling this simpler structure, and there are very high-performance regular-expression matching engines used to implement lexers.

Parsers are used to recognize "structure" of a language phrases. Such structure is generally far beyond what "regular expressions" can recognize, so one needs "context sensitive" parsers to extract such structure. Context-sensitive parsers are hard to build, so the engineering compromise is to use "context-free" grammars and add hacks to the parsers ("symbol tables", etc.) to handle the context-sensitive part.

Neither lexing nor parsing technology is likely to go away soon.

They may be unified by deciding to use "parsing" technology to recognize "words", as is currently explored by so-called scannerless GLR parsers. That has a runtime cost, as you are applying more general machinery to what is often a problem that doesn't need it, and usually you pay for that in overhead. Where you have lots of free cycles, that overhead may not matter. If you process a lot of text, then the overhead does matter and classical regular expression parsers will continue to be used.

answered Oct 07 '22 20:10

Ira Baxter

Related questions
                            
                                How to convert comma-delimited string to list in Python?
                            
                                Convert JSON to Map
                            
                                How to convert/parse from String to char in java?
                            
                                Print JSON parsed object?
                            
                                Java: parse int value from a char
                            
                                Remove file extension from a file name string
                            
                                How to parse a JSON string into JsonNode in Jackson?
                            
                                What is the difference between LL and LR parsing?
                            
                                How do I parse a string with a decimal point to a double?
                            
                                In Java, how do I parse XML as a String instead of a file?
                            
                                Parsing query strings on Android
                            
                                How to parse a string to an int in C++?
                            
                                How to read XML using XPath in Java
                            
                                Get URL parameters from a string in .NET
                            
                                How to convert string into float in JavaScript?
                            
                                How can I read and parse CSV files in C++?
                            
                                Splitting on last delimiter in Python string?
                            
                                Read and parse a Json File in C#
                            
                                Parse a URI String into Name-Value Collection
                            
                                How to parse data in JSON format?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

lexers vs parsers

Tags:

parsing

antlr

lexer

pygments

Naveen

People also ask

2 Answers

They read symbols of some alphabet from their input.

They analyse these symbols and try to match them with the grammar of the language they understood.

They attach semantics (meaning) to the language pieces they find.

They can attach some additional meaning (data) to the recognized elements.

They all produce on their output a proper sentences of the language they recognize.

Level 3: Regular grammars

Level 2: Context-free grammars

Level 1: Context-sensitive grammars

Level 0: Unrestricted grammars
Also called recursively enumerable grammars.

SasQ

Ira Baxter

Recent Activity

Donate For Us

lexers vs parsers

Tags:

parsing

antlr

lexer

pygments

Naveen

People also ask

2 Answers

They read symbols of some alphabet from their input.

They analyse these symbols and try to match them with the grammar of the language they understood.

They attach semantics (meaning) to the language pieces they find.

They can attach some additional meaning (data) to the recognized elements.

They all produce on their output a proper sentences of the language they recognize.

Level 3: Regular grammars

Level 2: Context-free grammars

Level 1: Context-sensitive grammars

Level 0: Unrestricted grammarsAlso called recursively enumerable grammars.

SasQ

Ira Baxter

Related questions

Recent Activity

Donate For Us

Level 0: Unrestricted grammars
Also called recursively enumerable grammars.