Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ANTLR: How the behavior of this grammar which recognizes suffixes of a Java code can be explained?

A week ago I started the following project: a grammar which recognizes suffixes of a Java code.

I used the official ANTLR grammar for Java (Java.g4) as a baseline and started to add some rules. However, those new rules also introduced left recursion which I also had to deal with.

After a couple of days of work I had the following code. When I started testing I noticed something unusual which I still can't explain. When given the input { } the parser tells me no viable alternative at input '<EOF>' but when I switch the order of the terminals in the right-handed side of the rule s2, particularly if we change the right-handed side from v2_1 | v2_2 | v2_3 ... to v2_36 | v2_1 | v2_2 ... (the terminal v2_36 is moved to the first position), the sequence { } gets accepted.

My first thoughts were that Antlr does not backtrack because I noticed that with the input { } the first version of the parser starts to follow the rule v2_3 and just reports that nothing is found and does not try to consider other options (that's what I think but maybe it's not true) like v2_36 which give exactly the positive answer.

But, after some research, I found out that ANTLR actually does backtrack but only if everything else fails. At least this is true for v3.3 (read it in official ANTLR paper) but I guess it's also true for v4. Now I'm a little bit confused. After spending so many hours on this project I would feel really terrible if I don't make it work. Can someone gives some kind of tip or something? It would be greatly appreciated, thanks.

EDIT

Managed to isolate the problem to

grammar Java;
@parser::members {String ruleName; }

start : compilationUnitSuf EOF;

compilationUnitSuf
    :   {ruleName = "typeDeclarationSuf"; } s2
    ;

s2: '{' '}' v2_81 | '{' '}';
v2_81 : {ruleName.equals("enumBodyDeclarationsSuf")}? t173 | t173 '}';
t173: '}' | '{'*;

LBRACKET: '{';
RBRACKET: '}';

WS  :  [ \t\r\n\u000C]+ -> skip
    ;

So why does the predicting algorithm suggest me to follow s2 -> v'{' '}' v2_81 -> ... instead of s2 -> '{' '}'?

like image 449
sve Avatar asked Aug 28 '13 19:08

sve


People also ask

How do you write an Antlr grammar?

Add the package name that you want to see in the Java file in which the lexer and parser files will be created. Add the Language in which you want the output like Java , Python etc. Tick the generate parser tree listener and generate tree visitor if you want to modify the visitor. Now the configuration is done.

How does Antlr lexer work?

Lexer Rule ActionsAn ANTLR lexer creates a Token object after matching a lexical rule. Each request for a token starts in Lexer. nextToken , which calls emit once it has identified a token. emit collects information from the current state of the lexer to build the token.

What is lexer and parser Antlr?

ANTLR or ANother Tool for Language Recognition is a lexer and parser generator aimed at building and walking parse trees. It makes it effortless to parse nontrivial text inputs such as a programming language syntax.


1 Answers

I think that you will find that it is not backtracking in the manner that you expect. The reason is that it finds the {} and then expects to see a v2_181, which it doesn't find. because it doesn't then backtrack, it doesn't find the alternative that you want. The alternative is to just make the v2_181 optional, then you don't need the backtracking. Something like below:

grammar Java;
@parser::members {String ruleName; }

start : compilationUnitSuf EOF;

compilationUnitSuf
    :   {ruleName = "typeDeclarationSuf"; } s2
    ;

s2: '{' '}' v2_81?;
v2_81 : {ruleName.equals("enumBodyDeclarationsSuf")}? t173 | t173 '}';
t173: '}' | '{'*;

LBRACKET: '{';
RBRACKET: '}';

WS  :  [ \t\r\n\u000C]+ -> skip
    ;
like image 137
Paul Wagland Avatar answered Sep 22 '22 14:09

Paul Wagland