Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding the context data structure in Antlr4

I'm trying to write a code translator in Java with the help of Antlr4 and had great success with the grammar part so far. However I'm now banging my head against a wall wrapping my mind around the parse tree data structure that I need to work on after my input has been parsed.

I'm trying to use the visitor template to go over my parse tree. I'll show you an example to illustrate the points of my confusion.

My grammar:

grammar pqlc;

// Lexer

//Schlüsselwörter
EXISTS: 'exists';
REDUCE: 'reduce';
QUERY: 'query';
INT: 'int';
DOUBLE: 'double';
CONST: 'const';
STDVECTOR: 'std::vector';
STDMAP: 'std::map';
STDSET: 'std::set';
C_EXPR: 'c_expr';

INTEGER_LITERAL  : (DIGIT)+ ;
fragment DIGIT: '0'..'9';
DOUBLE_LITERAL : DIGIT '.' DIGIT+;

LPAREN          : '(';
RPAREN          : ')';
LBRACK          : '[';
RBRACK          : ']';
DOT             : '.';
EQUAL           : '==';
LE              : '<=';
GE              : '>=';
GT              : '>';
LT              : '<';
ADD             : '+';
MUL             : '*';
AND             : '&&';
COLON           : ':';

IDENTIFIER    :   JavaLetter JavaLetterOrDigit*;
fragment JavaLetter    :   [a-zA-Z$_]; // these are the "java letters" below 0xFF
fragment JavaLetterOrDigit    :   [a-zA-Z0-9$_]; // these are the "java letters or digits" below 0xFF
WS  
    :  [ \t\r\n\u000C]+ -> skip  
    ;
COMMENT
    :   '/*' .*? '*/' -> skip
    ;

LINE_COMMENT
    :   '//' ~[\r\n]* -> skip
    ;


// Parser

//start_rule: query;

query :
      quant_expr
      | qexpr+
      | IDENTIFIER // order IDENTIFIER and qexpr+?
      | numeral
      | c_expr //TODO

      ;

c_type : INT | DOUBLE | CONST;
bin_op: AND | ADD | MUL | EQUAL | LT | GT | LE| GE;


qexpr:
         LPAREN query RPAREN bin_op_query? 
         // query bin_op query
         | IDENTIFIER  bin_op_query? // copied from query to resolve left recursion problem
         | numeral bin_op_query?  // ^
         | quant_expr bin_op_query? // ^
           |c_expr bin_op_query?
           // query.find(query)
         | IDENTIFIER  find_query? // copied from query to resolve left recursion problem
         | numeral find_query?  // ^
         | quant_expr find_query?
           |c_expr find_query?
           // query[query]
          | IDENTIFIER  array_query? // copied from query to resolve left recursion problem
         | numeral array_query?  // ^
         | quant_expr array_query?
           |c_expr array_query?

     // | qexpr bin_op_query // bad, resolved by quexpr+ in query 
     ;

bin_op_query: bin_op query bin_op_query?; // resolve left recursion of query bin_op query

find_query: '.''find' LPAREN query RPAREN;
array_query: LBRACK query RBRACK;

quant_expr:
    quant id ':' query
          | QUERY LPAREN match RPAREN ':' query
          | REDUCE LPAREN IDENTIFIER RPAREN id ':' query
    ;

match:
         STDVECTOR LBRACK id RBRACK EQUAL cm
     | STDMAP '.''find' LPAREN cm RPAREN EQUAL cm
     | STDSET '.''find' LPAREN cm RPAREN
     ;

cm:
    IDENTIFIER
  | numeral
   | c_expr //TODO
  ;

quant :
          EXISTS;

id :
     c_type IDENTIFIER
     | IDENTIFIER // Nach Seite 2 aber nicht der Übersicht. Laut übersicht id -> aber dann wäre Regel 1 ohne +
   ;

numeral :
            INTEGER_LITERAL
        | DOUBLE_LITERAL
        ;
c_expr:
          C_EXPR
      ;

Now let's parse the following string:

double x: x >= c_expr

Visually I'll get this tree: tree

Let's say my visitor is in the visitQexpr(@NotNull pqlcParser.QexprContext ctx) routine when it hits the branch Qexpr(x bin_op_query).

My question is, how can I tell that the left children ("x" in the tree) is a terminal node, or more specifically an "IDENTIFIER"? There are no visiting rules for Terminal nodes since they aren't rules. ctx.getChild(0) has no RuleIndex. I guess I could use that to check if I'm in a terminal or not, but that still wouldn't tell me if I was in IDENTIFIER or another kind of terminal token. I need to be able to tell the difference somehow.

I had more questions but in the time it took me to write the explanation I forgot them :< Thanks in advance.

like image 248
user2323596 Avatar asked Jun 12 '14 13:06

user2323596


People also ask

What is a context in ANTLR?

A rule context is a record of a single rule invocation. We form a stack of these context objects using the parent pointer. A parent pointer of null indicates that the current context is the bottom of the stack. The ParserRuleContext subclass as a children list so that we can turn this data structure into a tree.

What is ANTLR4 used for?

ANTLR 4 allows you to define lexer and parser rules in a single combined grammar file. This makes it really easy to get started. To get familiar with working with ANTLR, let's take a look at what a simple JSON grammar would look like and break it down. grammar Json; @header { package com.


1 Answers

You can add labels to tokens and access them/check if they exist in the surrounding context:

id :
     c_type labelA = IDENTIFIER
     | labelB = IDENTIFIER 
   ;

You could also do this to create different visits:

id :
     c_type IDENTIFIER    #idType1 //choose more appropriate names!
     | IDENTIFIER         #idType2
   ;

This will create different visitors for the two alternatives and I suppose (i.e. have not verified) that the visitor for id will not be called.

I prefer the following approach though:

id :
        typeDef
     |  otherId
     ;
typeDef: c_type IDENTIFIER;
otherId : IDENTIFIER ;

This is a more heavily typed system. But you can very specifically visit nodes. Some rules of thumb I use:

  1. Use | only when all alternatives are parser rules.
  2. Wrap each Token in a parser rule (like otherId) to give them "more meaning".
  3. It's ok to mix parser rules and tokens, if the tokens are not really important (like ;) and therefore not needed in the parse tree.
like image 154
Onur Avatar answered Sep 28 '22 19:09

Onur