Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ANTLR - trouble getting AST hierarchy setup

I am trying to get my head around the tree construction operators (^ and !) in ANTLR.

I have a grammar for flex byte arrays (a UINT16 that describe number of bytes in array, followed by that many bytes). I've commented out all of the semantic predicates and their associated code that does validate that there are as many bytes in the array as indicated by the first two bytes...that part isn't what I'm having problems with.

My issue is with the tree that is generated after parsing some input. All that happens is that each character is a sibling node. I was expecting that the AST generated would be similar to the tree that you can see in Interpreter window of ANTLRWorks 1.4. As soon as I try to change how the tree would be made using a ^ character, I get an exception of the form:

Unhandled Exception: System.SystemException: more than one node as root (TODO: make exception hierarchy)

Here's the grammar (currently targeting C#):

grammar FlexByteArray_HexGrammar;

options 
{
//language = 'Java';
language = 'CSharp2';
output=AST; 
}

expr 
    :   array_length remaining_data
        //the amount of remaining data must be equal to the array_length (times 2 since 2 hex characters per byte)
        // need to check if the array length is zero first to avoid checking $remaining_data.text (null reference) in that situation.
        //{ ($array_length.value == 0 && $remaining_data.text == null) || ($remaining_data.text != null && $array_length.value*2 == $remaining_data.text.Length) }?
    ;

array_length //returns [UInt16 value]
    :   uint16_little //{ $value = $uint16_little.value; }
    ;

hex_byte1 //needed just so I can distinguish between two bytes in a uint16 when doing a semantic predicate (or whatever you call it where I write in the target language in curly brackets)
    :   hex_byte
    ;

uint16_big //returns [UInt16 value]
    :   hex_byte1 hex_byte //{ $value = Convert.ToUInt16($hex_byte.text + $hex_byte1.text); }
    ;

uint16_little //returns [UInt16 value]
    :   hex_byte1 hex_byte //{ $value = Convert.ToUInt16($hex_byte1.text + $hex_byte.text); }
    ;

remaining_data 
    :   hex_byte*
    ;

hex_byte 
    :   HEX_DIGIT HEX_DIGIT
    ;

HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F')
;

Here's kind of what I thought the AST would be:

ANTLRWorks 1.4 interpreter output for input of "0001ff"

Here's the program in C# I was using to get a visual (actually textual, but then I put it thru GraphViz to get picture) representation of the AST:

namespace FlexByteArray_Hex
{
    using System;
    using Antlr.Runtime;
    using Antlr.Runtime.Tree;
    using Antlr.Utility.Tree;

    public class Program
    {
        public static void Main(string[] args)
        {
            ICharStream input = new ANTLRStringStream("0001ff");
            FlexByteArray_HexGrammarLexer lex = new FlexByteArray_HexGrammarLexer(input);
            CommonTokenStream tokens = new CommonTokenStream(lex);
            FlexByteArray_HexGrammarParser parser = new FlexByteArray_HexGrammarParser(tokens);

            Console.WriteLine("Parser created.");

            CommonTree tree = parser.expr().Tree as CommonTree;

            Console.WriteLine("------Input parsed-------");

            if (tree == null)
            {
                Console.WriteLine("Tree is null.");
            }
            else
            {
                DOTTreeGenerator treegen = new DOTTreeGenerator();
                Console.WriteLine(treegen.ToDOT(tree));
            }
        }
    }
}

Here's what the output of that program put into GraphViz looks like: graph viz output

The same program in Java (in case you want to try it out and don't use C#):

import org.antlr.*;
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;

public class Program
{
    public static void main(String[] args) throws Exception
    {
        FlexByteArray_HexGrammarLexer lex = new FlexByteArray_HexGrammarLexer(new ANTLRStringStream("0001ff"));
        CommonTokenStream tokens = new CommonTokenStream(lex);
        FlexByteArray_HexGrammarParser parser = new FlexByteArray_HexGrammarParser(tokens);

        System.out.println("Parser created.");

        CommonTree tree = (CommonTree)parser.expr().tree;

        System.out.println("------Input parsed-------");


        if (tree == null)
        {
            System.out.println("Tree is null.");
        }
        else
        {
            DOTTreeGenerator treegen = new DOTTreeGenerator();
            System.out.println(treegen.toDOT(tree));
        }
    }
}
like image 569
Anssssss Avatar asked Apr 27 '11 21:04

Anssssss


1 Answers

Anssssss wrote:

As soon as I try to change how the tree would be made using a ^ character, I get an exception of the form:

When trying to make the the parser rule a the root of the tree inside p like this:

p : a^ b;
a : A A;
b : B B;

ANTLR does not know which of the A's is the root of rule a. And there can't be two roots, of course.

The inline tree operators are handy in certain cases, but in this case, they're not up to the task. You can't assign a root inside a production rule that can possible have no contents, like your remaining_data rule. In that case you need to create "imaginary tokens" in the tokens { ... } section of your grammar and use rewrite rule (-> ^( ... )) to create your AST.

A demo

The following grammar:

grammar FlexByteArray_HexGrammar;

options {
  output=AST;
}

tokens {
  ROOT;
  ARRAY;
  LENGTH;
  DATA;
}

expr
  :  array* EOF -> ^(ROOT array*)
  ;

array
@init { int index = 0; }
  :  array_length array_data[$array_length.value] -> ^(ARRAY array_length array_data)
  ;

array_length returns [int value]
  :  a=hex_byte b=hex_byte {$value = $a.value*16*16 + $b.value;} -> ^(LENGTH hex_byte hex_byte)
  ;

array_data [int length]
  :  ({length > 0}?=> hex_byte {length--;})* {length == 0}? -> ^(DATA hex_byte*)
  ;

hex_byte returns [int value]
  :  a=HEX_DIGIT b=HEX_DIGIT {$value = Integer.parseInt($a.text+$b.text, 16);}
  ;

HEX_DIGIT
  :  '0'..'9' | 'a'..'f' | 'A'..'F'
  ;

will parse the following input:

0001ff0002abcd

into the following AST:

enter image description here

as you can see using the following main class:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
    public static void main(String[] args) throws Exception {
        FlexByteArray_HexGrammarLexer lexer = new FlexByteArray_HexGrammarLexer(new ANTLRStringStream("0001ff0002abcd"));
        FlexByteArray_HexGrammarParser parser = new FlexByteArray_HexGrammarParser(new CommonTokenStream(lexer));
        CommonTree tree = (CommonTree)parser.expr().getTree();
        DOTTreeGenerator gen = new DOTTreeGenerator();
        StringTemplate st = gen.toDOT(tree);
        System.out.println(st);
    }
}

More info

  • if you want to know more about the predicates I used in the array_data parser rule, see this previous Q&A: What is a 'semantic predicate' in ANTLR?
  • to get more info about the tree rewrite operator -> ^( ... ), see this previous Q&A: How to output the AST built using ANTLR?

EDIT

To give a short explanation of the array_data rule:

array_data [int length]
  :  ({length > 0}?=> hex_byte {length--;})* {length == 0}? -> ^(DATA hex_byte*)
  ;

As you already mentioned in the comments, you can pass one or more parameters to rules by adding [TYPE IDENTIFIER] after the rule.

The first (gated) semantic predicate, {length > 0}?=>, checks if length is more than zero. If this is the case, the parser tries to match a hex_byte after which the length variable is decreased by one. This all stops when length is zero, or when the parser has no more hex_byte to be parsed which happens when the EOF is next in line. Because it could parse less than the mandatory number of hex_bytes, there is a (validating) semantic predicate at the very end of the rule, {length == 0}?, that ensures that the correct amount of hex_bytes have been parsed (no more, and no less!).

Hope that clarifies it a bit more.

like image 186
Bart Kiers Avatar answered Nov 20 '22 13:11

Bart Kiers