Antlr rule priorities

Tags:

antlr3

Firstly I know this grammar doesn't make sense but it was created to test out the ANTLR rule priority behaviour

grammar test;

options 
{

output=AST;
backtrack=true;
memoize=true;

}

rule_list_in_order :
    (
    first_rule
    | second_rule
    | any_left_over_tokens)+
    ;


first_rule
    :
     FIRST_TOKEN
    ;


second_rule:     
    FIRST_TOKEN NEW_LINE SECOND_TOKEN NEW_LINE;


any_left_over_tokens
    :
    NEW_LINE
    | FIRST_TOKEN
    | SECOND_TOKEN;



FIRST_TOKEN
    : 'First token here'
    ;   

SECOND_TOKEN
    : 'Second token here';

NEW_LINE
    : ('\r'?'\n')   ;

WS  : (' '|'\t'|'\u000C')
    {$channel=HIDDEN;}
    ;

When I give this grammar the input 'First token here\nSecond token here', it matches the second_rule.

I would have expected it to match the first rule then any_left_over_tokens because the first_rule appears before the second_rule in the rule_order_list which is the start point. Can anyone explain why this happens?

Cheers

592

asked Feb 04 '11 15:02

probably at the beach

1 Answers

First of all, ANTLR's lexer will tokenize the input from top to bottom. So tokens defined first have a higher precedence than the ones below it. And in case rule have overlapping tokens, the rule that matches the most characters will take precedence (greedy match).

The same principle holds within parser rules. Rules defined first will also be matched first. For example, in rule foo, sub-rule a will first be tried before b:

foo
  :  a
  |  b
  ;

Note that in your case, the 2^nd rule isn't matched, but tries to do so, and fails because there is no trailing line break, producing the error:

line 0:-1 mismatched input '<EOF>' expecting NEW_LINE

So, nothing is matched at all. But that is odd. Because you've set the backtrack=true, it should at least backtrack and match:

first_rule ("First token here")
any_left_over_tokens ("line-break")
any_left_over_tokens ("Second token here")

if not match first_rule in the first place and not even try to match second_rule to begin with.

A quick demo when doing the predicates manually (and disabling the backtrack in the options { ... } section) would look like:

grammar T;

options {
  output=AST;
  //backtrack=true;
  memoize=true;
}

rule_list_in_order
  :  ( (first_rule)=>  first_rule  {System.out.println("first_rule=[" + $first_rule.text + "]");}
     | (second_rule)=> second_rule {System.out.println("second_rule=[" + $second_rule.text + "]");}
     | any_left_over_tokens        {System.out.println("any_left_over_tokens=[" + $any_left_over_tokens.text + "]");}
     )+ 
  ;

first_rule
  :  FIRST_TOKEN
  ;

second_rule
  :  FIRST_TOKEN NEW_LINE SECOND_TOKEN NEW_LINE
  ;

any_left_over_tokens
  :  NEW_LINE
  |  FIRST_TOKEN
  |  SECOND_TOKEN
  ;

FIRST_TOKEN  : 'First token here';   
SECOND_TOKEN : 'Second token here';
NEW_LINE     : ('\r'?'\n');
WS           : (' '|'\t'|'\u000C') {$channel=HIDDEN;};

which can be tested with the class:

import org.antlr.runtime.*;

public class Main {
    public static void main(String[] args) throws Exception {
        String source = "First token here\nSecond token here";
        ANTLRStringStream in = new ANTLRStringStream(source);
        TLexer lexer = new TLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TParser parser = new TParser(tokens);
        parser.rule_list_in_order();
    }
}

which produces the expected output:

first_rule=[First token here]
any_left_over_tokens=[
]
any_left_over_tokens=[Second token here]

Note that it doesn't matter if you use:

rule_list_in_order
  :  ( (first_rule)=>  first_rule 
     | (second_rule)=> second_rule
     | any_left_over_tokens
     )+ 
  ;

rule_list_in_order
  :  ( (second_rule)=> second_rule // <--+--- swapped
     | (first_rule)=>  first_rule  // <-/
     | any_left_over_tokens
     )+ 
  ;

, both will produce the expected output.

So, my guess is that you may have found a bug.

Yout could try the ANTLR mailing-list, in case you want a definitive answer (Terence Parr frequents there more often than he does here).

Good luck!

PS. I tested this with ANTLR v3.2

answered Oct 25 '22 16:10

Bart Kiers

Related questions
                            
                                Drag and Drop visual editor in Eclipse
                            
                                How do LL(*) parsers work?
                            
                                Parsing JavaScript regex with ANTLR
                            
                                Building own C# compiler using ANTLR: Compilation Unit
                            
                                ANTLR4 and the Python target
                            
                                Parsing wikimedia markup - are EBNF-based parsers poorly suited?
                            
                                In ANTLR, how do you specify a specific number of repetitions?
                            
                                ANTLR Parser with manual lexer
                            
                                Editor generator for ANTLR grammars?
                            
                                Matching arbitrary text (both symbols and spaces) with ANTLR?
                            
                                Parsing CSS with ANTLR - edge cases
                            
                                ANTLR: problem differntiating unary and binary operators (e.g. minus sign)
                            
                                Using @header in ANTLR
                            
                                Using ANTLR for static analysis of Java source file
                            
                                ANTLRInputStream and ANTLRFileStream are deprecated, what are the alternatives?
                            
                                Getting started with ANTLR and avoiding common mistakes
                            
                                Writing a custom Xtext/ANTLR lexer without a grammar file
                            
                                ANTLR4: Using non-ASCII characters in token rules
                            
                                Slow ANTLR4 generated Parser in Python, but fast in Java
                            
                                Generating an Abstract Syntax Tree for java source code using ANTLR

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With