Note: This is a self-answered question that aims to provide a reference about one of the most common mistakes made by ANTLR users.
When I test this very simple grammar:
grammar KeyValues;
keyValueList: keyValue*;
keyValue: key=IDENTIFIER '=' value=INTEGER ';';
IDENTIFIER: [A-Za-z0-9]+;
INTEGER: [0-9]+;
WS: [ \t\r\n]+ -> skip;
With the following input:
foo = 42;
I end up with the following run-time error:
line 1:6 mismatched input '42' expecting INTEGER
line 1:8 mismatched input ';' expecting '='
Why doesn't ANTLR recognize 42
as an INTEGER
in this case?
It should match the pattern [0-9]+
just fine.
If I invert the order in which INTEGER
and IDENTIFIER
are defined it seems to work, but why does the order matter in the first place?
An ANTLR lexer creates a Token object after matching a lexical rule. Each request for a token starts in Lexer. nextToken , which calls emit once it has identified a token. emit collects information from the current state of the lexer to build the token.
ANTLR is a powerful parser generator that you can use to read, process, execute, or translate structured text or binary files. It's widely used in academia and industry to build all sorts of languages, tools, and frameworks. Twitter search uses ANTLR for query parsing, with over 2 billion queries a day.
A language is specified using a context-free grammar expressed using Extended Backus–Naur Form (EBNF). ANTLR can generate lexers, parsers, tree parsers, and combined lexer-parsers.
Fragments are reusable parts of lexer rules which cannot match on their own - they need to be referenced from a lexer rule.
In ANTLR, the lexer is isolated from the parser, which means it will split the text into typed tokens according to the lexer grammar rules, and the parser has no influence on this process (it cannot say "give me an INTEGER
now" for instance). It produces a token stream by itself. Furthermore, the parser doesn't care about the token text, it only cares about the token types to match its rules.
This may easily become a problem when several lexer rules can match the same input text. In that case, the token type will be chosen according to these precedence rules:
'='
), use the implicit rule as the token typeThese rules are very important to keep in mind in order to use ANTLR effectively.
In the example from the question, the parser expects to see the following token stream to match the keyValue
parser rule: IDENTIFIER
'='
INTEGER
';'
where '='
and ';'
are implicit token types.
Since 42
can match both INTEGER
and IDENTIFIER
, and IDENTIFIER
is defined first, the parser will receive the following input: IDENTIFIER
'='
IDENTIFIER
';'
which it won't be able to match to the keyValue
rule. Remember, the parser cannot communicate to the lexer, it can only receive data from it, therefore it cannot say "try to match INTEGER
next".
It's advisable to minimize the lexer rules overlap to limit the impact of this effect. In the above example, we have several options:
IDENTIFIER
as [A-Za-z] [A-Za-z0-9]*
(require it to start with a letter). This avoids the problem entirely but prevents identifier names starting with a number from being defined, so it changes the intent of the grammar.INTEGER
and IDENTIFIER
. This solves the problem for most cases, but prevents fully numeric identifiers from being defined, therefore it also changes the intent of the grammar in a subtle, not so obvious way.INTEGER
and IDENTIFIER
in order to give priority to INTEGER
. Then, define a parser rule id: IDENTIFIER | INTEGER;
then use that rule instead of IDENTIFIER
in other parser rules, which would change keyValue
to key=id '=' value=INTEGER ';'
.Here's a second lexer behavior example to sum up:
The following combined grammar:
grammar LexerPriorityRulesExample;
// Parser rules
randomParserRule: 'foo'; // Implicitly declared token type
// Lexer rules
BAR: 'bar';
IDENTIFIER: [A-Za-z]+;
BAZ: 'baz';
WS: [ \t\r\n]+ -> skip;
Given the following input:
aaa foo bar baz barz
Will produce the following token sequence from the lexer:
IDENTIFIER
'foo'
BAR
IDENTIFIER
IDENTIFIER
EOF
aaa
is of type IDENTIFIER
Only the IDENTIFIER
rule can match this token, there is no ambiguity.
foo
is of type 'foo'
The parser rule randomParserRule
introduces the implicit 'foo'
token type, which is prioritary over the IDENTIFIER
rule.
bar
is of type BAR
This text matches the BAR
rule, which is defined before the IDENTIFIER
rule, and therefore has precedence.
baz
is of type IDENTIFIER
This text matches the BAZ
rule, but it also matches the IDENTIFIER
rule. The latter is chosen as it is defined before BAR
.
Given the grammar, BAZ
will never be able to match, as the IDENTIFIER
rule already covers everything BAZ
can match.
barz
is of type IDENTIFIER
The BAR
rule can match the first 3 characters of this string (bar
), but the IDENTIFIER
rule will match 4 characters. As IDENTIFIER
matches a longer substring, it is chosen over BAR
.
EOF
(end of file) is an implicitly defined token type which always occurs at the end of the input.
As a rule of thumb, specific rules should de defined before more generic rules. If a rule can only match an input which is already covered by a previously defined rule, it will never be used.
Implicitly defined rules such as 'foo'
act as if they were defined before all other lexer rules. As they add complexity, it's advisable to avoid them altogether and declare explicit lexer rules instead. Just having a list of tokens in one place instead of having them scattered across the grammar is a compelling advantage of this approach.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With