Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ANTLR parses greedily even though it can match high priority rule

Tags:

antlr

antlr4

I am using the following ANTLR grammar to define a function.

definition_function
    : DEFINE FUNCTION function_name '[' language_name ']'
      RETURN attribute_type '{' function_body '}'
    ;

function_name
    : id
    ;

language_name
    : id
    ;

function_body
    : SCRIPT
    ;

SCRIPT
    :   '{' ('\u0020'..'\u007e' | ~( '{' | '}' ) )* '}' 
        { setText(getText().substring(1, getText().length()-1)); }
    ;

But when I try to parse two functions like below,

define function concat[Scala] return string {
  var concatenatedString = ""
  for(i <- 0 until data.length) {
     concatenatedString += data(i).toString
  }
  concatenatedString
};
define function concat[JavaScript] return string {
  var str1 = data[0];
  var str2 = data[1];
  var str3 = data[2];
  var res = str1.concat(str2,str3);
  return res;
};

Then ANTLR doesn't parse this like two function definitions, but like a single function with the following body,

  var concatenatedString = ""
  for(i <- 0 until data.length) {
     concatenatedString += data(i).toString
  }
  concatenatedString
};
define function concat[JavaScript] return string {
  var str1 = data[0];
  var str2 = data[1];
  var str3 = data[2];
  var res = str1.concat(str2,str3);
  return res;

Can you explain this behavior? The body of the function can have anything in it. How can I correctly define this grammar?

like image 834
Ayash Avatar asked Feb 13 '15 09:02

Ayash


1 Answers

Your rule matches that much because '\u0020'..'\u007e' from the rule '{' ('\u0020'..'\u007e' | ~( '{' | '}' ) )* '}' matches both { and }.

Your rule should work if you define it like this:

SCRIPT
    :   '{' ( SCRIPT | ~( '{' | '}' ) )* '}' 
    ;

However, this will fail when the script block contains, says, strings or comments that contain { or }. Here is a way to match a SCRIPT token, including comments and string literals that could contain { and '}':

SCRIPT
 : '{' SCRIPT_ATOM* '}'
 ;

fragment SCRIPT_ATOM
 : ~[{}]
 | '"' ~["]* '"'
 | '//' ~[\r\n]*
 | SCRIPT
 ;

A complete grammar that properly parses your input would then look like this:

grammar T;

parse
 : definition_function* EOF
 ;

definition_function
 : DEFINE FUNCTION function_name '[' language_name ']' RETURN attribute_type SCRIPT ';'
 ;

function_name
 : ID
 ;

language_name
 : ID
 ;

attribute_type
 : ID
 ;

DEFINE
 : 'define'
 ;

FUNCTION
 : 'function'
 ;

RETURN
 : 'return'
 ;

ID
 : [a-zA-Z_] [a-zA-Z_0-9]*
 ;

SCRIPT
 : '{' SCRIPT_ATOM* '}'
 ;

SPACES
 : [ \t\r\n]+ -> skip
 ;

fragment SCRIPT_ATOM
 : ~[{}]
 | '"' ~["]* '"'
 | '//' ~[\r\n]*
 | SCRIPT
 ;

which also parses the following input properly:

define function concat[JavaScript] return string {
  for (;;) { 
    while (true) { } 
  }
  var s = "}"
  // }
  return s 
};
like image 142
Bart Kiers Avatar answered Nov 10 '22 05:11

Bart Kiers