<blockquote> Note: This is a self-answered question that aims to provide a reference about one of the most common mistakes made by ANTLR users. </blockquote> <hr> When I test this very simple grammar: <pre class="prettyprint"><code>grammar KeyValues; keyValueList: keyValue*; keyValue: key=IDENTIFIER '=' value=INTEGER ';'; IDENTIFIER: [A-Za-z0-9]+; INTEGER: [0-9]+; WS: [ \t\r\n]+ -> skip; </code></pre> With the following input: <pre class="prettyprint"><code>foo = 42; </code></pre> I end up with the following run-time error: <blockquote> line 1:6 mismatched input '42' expecting INTEGER line 1:8 mismatched input ';' expecting '=' </blockquote> Why doesn't ANTLR recognize <code>42</code> as an <code>INTEGER</code> in this case? It should match the pattern <code>[0-9]+</code> just fine. If I invert the order in which <code>INTEGER</code> and <code>IDENTIFIER</code> are defined it seems to work, but why does the order matter in the first place?

In ANTLR, the lexer is isolated from the parser, which means it will split the text into typed tokens according to the lexer grammar rules, and the parser has no influence on this process (it cannot say "give me an <code>INTEGER</code> now" for instance). It produces a token stream by itself. Furthermore, the parser doesn't care about the token text, it only cares about the token types to match its rules. This may easily become a problem when several lexer rules can match the same input text. In that case, the token type will be chosen according to these precedence rules: <ul> <li>First, select the lexer rules which match the longest input substring</li> <li>If the longest matched substring is equal to an implicitly defined token (like <code>'='</code>), use the implicit rule as the token type</li> <li>If several lexer rules match the same input, choose the first one, based on definition order</li> </ul> These rules are very important to keep in mind in order to use ANTLR effectively. <hr> In the example from the question, the parser expects to see the following token stream to match the <code>keyValue</code> parser rule: <code>IDENTIFIER</code> <code>'='</code> <code>INTEGER</code> <code>';'</code> where <code>'='</code> and <code>';'</code> are implicit token types. Since <code>42</code> can match both <code>INTEGER</code> and <code>IDENTIFIER</code>, and <code>IDENTIFIER</code> is defined first, the parser will receive the following input: <code>IDENTIFIER</code> <code>'='</code> <code>IDENTIFIER</code> <code>';'</code> which it won't be able to match to the <code>keyValue</code> rule. Remember, the parser cannot communicate to the lexer, it can only receive data from it, therefore it cannot say "try to match <code>INTEGER</code> next". It's advisable to minimize the lexer rules overlap to limit the impact of this effect. In the above example, we have several options: <ul> <li>Redefine <code>IDENTIFIER</code> as <code>[A-Za-z] [A-Za-z0-9]*</code> (require it to start with a letter). This avoids the problem entirely but prevents identifier names starting with a number from being defined, so it changes the intent of the grammar.</li> <li>Reorder <code>INTEGER</code> and <code>IDENTIFIER</code>. This solves the problem for most cases, but prevents fully numeric identifiers from being defined, therefore it also changes the intent of the grammar in a subtle, not so obvious way.</li> <li>Make the parser accept both token types when lexer rules overlap: First, swap <code>INTEGER</code> and <code>IDENTIFIER</code> in order to give priority to <code>INTEGER</code>. Then, define a parser rule <code>id: IDENTIFIER | INTEGER;</code> then use that rule instead of <code>IDENTIFIER</code> in other parser rules, which would change <code>keyValue</code> to <code>key=id '=' value=INTEGER ';'</code>.</li> </ul> <hr> Here's a second lexer behavior example to sum up: The following combined grammar: <pre class="prettyprint"><code>grammar LexerPriorityRulesExample; // Parser rules randomParserRule: 'foo'; // Implicitly declared token type // Lexer rules BAR: 'bar'; IDENTIFIER: [A-Za-z]+; BAZ: 'baz'; WS: [ \t\r\n]+ -> skip; </code></pre> Given the following input: <pre class="prettyprint"><code>aaa foo bar baz barz </code></pre> Will produce the following token sequence from the lexer: <code>IDENTIFIER</code> <code>'foo'</code> <code>BAR</code> <code>IDENTIFIER</code> <code>IDENTIFIER</code> <code>EOF</code> <ul> <li> <code>aaa</code> is of type <code>IDENTIFIER</code> Only the <code>IDENTIFIER</code> rule can match this token, there is no ambiguity. </li> <li> <code>foo</code> is of type <code>'foo'</code> The parser rule <code>randomParserRule</code> introduces the implicit <code>'foo'</code> token type, which is prioritary over the <code>IDENTIFIER</code> rule. </li> <li> <code>bar</code> is of type <code>BAR</code> This text matches the <code>BAR</code> rule, which is defined before the <code>IDENTIFIER</code> rule, and therefore has precedence. </li> <li> <code>baz</code> is of type <code>IDENTIFIER</code> This text matches the <code>BAZ</code> rule, but it also matches the <code>IDENTIFIER</code> rule. The latter is chosen as it is defined before <code>BAR</code>. Given the grammar, <code>BAZ</code> will never be able to match, as the <code>IDENTIFIER</code> rule already covers everything <code>BAZ</code> can match. </li> <li> <code>barz</code> is of type <code>IDENTIFIER</code> The <code>BAR</code> rule can match the first 3 characters of this string (<code>bar</code>), but the <code>IDENTIFIER</code> rule will match 4 characters. As <code>IDENTIFIER</code> matches a longer substring, it is chosen over <code>BAR</code>. </li> <li><code>EOF</code> (end of file) is an implicitly defined token type which always occurs at the end of the input.</li> </ul> As a rule of thumb, specific rules should de defined before more generic rules. If a rule can only match an input which is already covered by a previously defined rule, it will never be used. Implicitly defined rules such as <code>'foo'</code> act as if they were defined before all other lexer rules. As they add complexity, it's advisable to avoid them altogether and declare explicit lexer rules instead. Just having a list of tokens in one place instead of having them scattered across the grammar is a compelling advantage of this approach.

How does the ANTLR lexer disambiguate its rules (or why does my parser produce "mismatched input" errors)?

Tags:

parsing

antlr

lexer

antlr4

Note: This is a self-answered question that aims to provide a reference about one of the most common mistakes made by ANTLR users.

When I test this very simple grammar:

grammar KeyValues;

keyValueList: keyValue*;
keyValue: key=IDENTIFIER '=' value=INTEGER ';';

IDENTIFIER: [A-Za-z0-9]+;
INTEGER: [0-9]+;

WS: [ \t\r\n]+ -> skip;

With the following input:

foo = 42;

I end up with the following run-time error:

line 1:6 mismatched input '42' expecting INTEGER
line 1:8 mismatched input ';' expecting '='

Why doesn't ANTLR recognize 42 as an INTEGER in this case?
It should match the pattern [0-9]+ just fine.

If I invert the order in which INTEGER and IDENTIFIER are defined it seems to work, but why does the order matter in the first place?

320

asked Sep 17 '17 19:09

Lucas Trzesniewski

1 Answers

In ANTLR, the lexer is isolated from the parser, which means it will split the text into typed tokens according to the lexer grammar rules, and the parser has no influence on this process (it cannot say "give me an INTEGER now" for instance). It produces a token stream by itself. Furthermore, the parser doesn't care about the token text, it only cares about the token types to match its rules.

This may easily become a problem when several lexer rules can match the same input text. In that case, the token type will be chosen according to these precedence rules:

First, select the lexer rules which match the longest input substring
If the longest matched substring is equal to an implicitly defined token (like '='), use the implicit rule as the token type
If several lexer rules match the same input, choose the first one, based on definition order

These rules are very important to keep in mind in order to use ANTLR effectively.

In the example from the question, the parser expects to see the following token stream to match the keyValue parser rule: IDENTIFIER '=' INTEGER ';' where '=' and ';' are implicit token types.

Since 42 can match both INTEGER and IDENTIFIER, and IDENTIFIER is defined first, the parser will receive the following input: IDENTIFIER '=' IDENTIFIER ';' which it won't be able to match to the keyValue rule. Remember, the parser cannot communicate to the lexer, it can only receive data from it, therefore it cannot say "try to match INTEGER next".

It's advisable to minimize the lexer rules overlap to limit the impact of this effect. In the above example, we have several options:

Redefine IDENTIFIER as [A-Za-z] [A-Za-z0-9]* (require it to start with a letter). This avoids the problem entirely but prevents identifier names starting with a number from being defined, so it changes the intent of the grammar.
Reorder INTEGER and IDENTIFIER. This solves the problem for most cases, but prevents fully numeric identifiers from being defined, therefore it also changes the intent of the grammar in a subtle, not so obvious way.
Make the parser accept both token types when lexer rules overlap:
First, swap INTEGER and IDENTIFIER in order to give priority to INTEGER. Then, define a parser rule id: IDENTIFIER | INTEGER; then use that rule instead of IDENTIFIER in other parser rules, which would change keyValue to key=id '=' value=INTEGER ';'.

Here's a second lexer behavior example to sum up:

The following combined grammar:

grammar LexerPriorityRulesExample;

// Parser rules

randomParserRule: 'foo'; // Implicitly declared token type

// Lexer rules

BAR: 'bar';
IDENTIFIER: [A-Za-z]+;
BAZ: 'baz';

WS: [ \t\r\n]+ -> skip;

Given the following input:

aaa foo bar baz barz

Will produce the following token sequence from the lexer:

IDENTIFIER 'foo' BAR IDENTIFIER IDENTIFIER EOF

aaa is of type IDENTIFIER

Only the IDENTIFIER rule can match this token, there is no ambiguity.
foo is of type 'foo'

The parser rule randomParserRule introduces the implicit 'foo' token type, which is prioritary over the IDENTIFIER rule.
bar is of type BAR

This text matches the BAR rule, which is defined before the IDENTIFIER rule, and therefore has precedence.
baz is of type IDENTIFIER

This text matches the BAZ rule, but it also matches the IDENTIFIER rule. The latter is chosen as it is defined before BAR.

Given the grammar, BAZ will never be able to match, as the IDENTIFIER rule already covers everything BAZ can match.
barz is of type IDENTIFIER

The BAR rule can match the first 3 characters of this string (bar), but the IDENTIFIER rule will match 4 characters. As IDENTIFIER matches a longer substring, it is chosen over BAR.
EOF (end of file) is an implicitly defined token type which always occurs at the end of the input.

As a rule of thumb, specific rules should de defined before more generic rules. If a rule can only match an input which is already covered by a previously defined rule, it will never be used.

Implicitly defined rules such as 'foo' act as if they were defined before all other lexer rules. As they add complexity, it's advisable to avoid them altogether and declare explicit lexer rules instead. Just having a list of tokens in one place instead of having them scattered across the grammar is a compelling advantage of this approach.

105

answered Sep 30 '22 03:09

Lucas Trzesniewski

Related questions
                            
                                Why is an anonymous function on its own a syntax error in javascript?
                            
                                Perl - Parse URL to get a GET Parameter Value
                            
                                Inline external CSS with HTML
                            
                                How to remove trailing comments via regexp?
                            
                                Parsing an equation with custom functions in Python
                            
                                Bison one or more occurrences in grammar file
                            
                                Parsing html using Selenium - class name contains spaces
                            
                                ANTLR AST rules fail with RewriteEmptyStreamException
                            
                                Wikipedia : Java library to remove wikipedia text markup removal
                            
                                How can I escape single or double quotation marks in CSS?
                            
                                What are the advantages of the "apply" functions? When are they better to use than "for" loops, and when are they not? [duplicate]
                            
                                Parsing JSON from HttpClient request using JSON.org parser
                            
                                Problems with PLY LEX and YACC
                            
                                Open-source parser code for Mediawiki markup [closed]
                            
                                When should I use a parser?
                            
                                Using Parsec to parse regular expressions
                            
                                How to match a long with Java regex?
                            
                                How would I go about Implementing A Simple Stack-Based Programming Language
                            
                                Parsing Json File using Jackson
                            
                                How to parse files that cannot fit entirely in memory RAM

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does the ANTLR lexer disambiguate its rules (or why does my parser produce "mismatched input" errors)?

Tags:

parsing

antlr

lexer

antlr4

Lucas Trzesniewski

People also ask

1 Answers

Lucas Trzesniewski

Recent Activity

Donate For Us