Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solving antlr left recursion

I'm trying to parse a language using ANTLR which can contain the following syntax:

someVariable, somVariable.someMember, functionCall(param).someMember,  foo.bar.baz(bjork).buffalo().xyzzy

This is the ANTLR grammar which i've come up with so far, and the access_operation throws the error

The following sets of rules are mutually left-recursive [access_operation, expression]:

grammar Test;

options { 
  output=AST;
  ASTLabelType=CommonTree; 
}

tokens {
  LHS;
  RHS;
  CALL;
  PARAMS;
}

start   
  :  body? EOF
  ;

body
  : expression (',' expression)*
  ;

expression
  : function -> ^(CALL)
  | access_operation
  | atom
  ;

access_operation
  : (expression -> ^(LHS)) '.'! (expression -> ^(RHS))
  ;

function
  : (IDENT '(' body? ')') -> ^(IDENT PARAMS?) 
  ;         

atom
  : IDENT
  | NUMBER
  ;

fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT  : '0'..'9';

IDENT    : (LETTER)+ ;
NUMBER   : (DIGIT)+ ;
SPACE    : (' ' | '\t' | '\r' | '\n') { $channel=HIDDEN; };

What i could manage so far was to refactor the access_operation rule to '.' expression which generates an AST where the access_operation node only contains the right side of the operation.

What i'm looking for instead is something like this:

enter image description here

How can the left-recursion problem solved in this case?

like image 593
pulse00 Avatar asked Jan 02 '12 00:01

pulse00


1 Answers

By "wrong AST" I'll make a semi educated guess that, for input like "foo.bar.baz", you get an AST where foo is the root with bar as a child who in its turn has baz as a child, which is a leaf in the AST. You may want to have this reversed. But I'd not go for such an AST if I were you: I'd keep the AST as flat as possible:

    foo
   / | \
  /  |  \
bar baz  ...

That way, evaluating is far easier: you simply look up foo, and then walk from left to right through its children.

A quick demo:

grammar Test;

options { 
  output=AST;
  ASTLabelType=CommonTree; 
}

tokens {
  BODY;
  ACCESS;
  CALL;
  PARAMS;
}

start   
 : body EOF -> body
 ;

body
 : expression (',' expression)* -> ^(BODY expression+)
 ;

expression
 : atom
 ;         

atom
 : NUMBER
 | (IDENT -> IDENT) ( tail       -> ^(IDENT tail)
                    | call tail? -> ^(CALL IDENT call tail?)
                    )?
 ;

tail
 : (access)+
 ;

access
 : ('.' IDENT -> ^(ACCESS IDENT)) (call -> ^(CALL IDENT call))?
 ;

call
 : '(' (expression (',' expression)*)? ')' -> ^(PARAMS expression*)
 ;

IDENT  : LETTER+;
NUMBER : DIGIT+;
SPACE  : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};

fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT  : '0'..'9';

which can be tested with:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String src = "someVariable, somVariable.someMember, functionCall(param).someMember, " + 
        "foo.bar.baz(bjork).buffalo().xyzzy";
    TestLexer lexer = new TestLexer(new ANTLRStringStream(src));
    TestParser parser = new TestParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.start().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

The output of Main corresponds to the following AST:

enter image description here

EDIT

And since you indicated your ultimate goal is not evaluating the input, but that you rather need to conform the structure of the AST to some 3rd party API, here's a grammar that will create an AST like you indicated in your edited question:

grammar Test;

options { 
  output=AST;
  ASTLabelType=CommonTree; 
}

tokens {
  BODY;
  ACCESS_OP;
  CALL;
  PARAMS;
  LHS;
  RHS;
}

start   
 : body EOF -> body
 ;

body
 : expression (',' expression)* -> ^(BODY expression+)
 ;

expression
 : atom
 ;         

atom
 : NUMBER
 | (ID -> ID) ( ('(' params ')' -> ^(CALL ID params)) 
                ('.' expression -> ^(ACCESS_OP ^(LHS ^(CALL ID params)) ^(RHS expression)))?
              | '.' expression  -> ^(ACCESS_OP ^(LHS ID) ^(RHS expression))
              )?
 ;

params
 : (expression (',' expression)*)? -> ^(PARAMS expression*)
 ;

ID     : LETTER+;
NUMBER : DIGIT+;
SPACE  : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};

fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT  : '0'..'9';

which creates the following AST if you run the Main class:

enter image description here

The atom rule may be a bit daunting, but you can't shorten it much since the left ID needs to be available to most of the alternatives. ANTLRWorks helps in visualizing the alternative paths this rule may take:

enter image description here

which means atom can be any of the 5 following alternatives (with their corresponding AST's):

+----------------------+--------------------------------------------------------+
| alternative          | generated AST                                          |
+----------------------+--------------------------------------------------------+
| NUMBER               | NUMBER                                                 |
| ID                   | ID                                                     |
| ID params            | ^(CALL ID params)                                      |
| ID params expression | ^(ACCESS_OP ^(LHS ^(CALL ID params)) ^(RHS expression))|
| ID expression        | ^(ACCESS_OP ^(LHS ID) ^(RHS expression)                |
+----------------------+--------------------------------------------------------+
like image 141
Bart Kiers Avatar answered Sep 23 '22 02:09

Bart Kiers