Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

'IDENTIFIER' rule also consumes keyword in ANTLR Lexer grammar

While working on Antlr 3.5 grammar for Java parsing noticed that 'IDENTIFIER' rule consumes few Keywords in ANTLR Lexer grammar. The Lexer grammar is

lexer grammar JavaLexer;

options {
   //k=8;
   language=Java;
   filter=true;
   //backtrack=true;
}

@lexer::header {
package java;
}

@lexer::members {
public ArrayList<String> keywordsList = new ArrayList<String>();
}

V_DECLARATION
:
( ((MODIFIERS)=>tok1=MODIFIERS WS+)? tok2=TYPE WS+ var=V_DECLARATOR WS* )
{...};

fragment
V_DECLARATOR
  :
  (
    tok=IDENTIFIER WS* ( ',' | ';' | ASSIGN WS* V_VALUE )
  )
  {...};

fragment
V_VALUE
: (IDENTIFIER (DOT WS* IDENTIFIER WS* '(' | ',' | ';'))
;

MODIFIERS
  :
  (PUBLIC | PRIVATE | FINAL)+
;

PRIVATE
    :    tok = 'private'
    { keywordsList.add($tok.getText());  }
    ;

PUBLIC
    :    tok = 'public'
    { keywordsList.add($tok.getText()); }
    ;

DOT
    :    '.'
    { keywordsList.add("."); }
    ;

THIS
    :    tok = 'this'
    { keywordsList.add($tok.getText()); }
    ;

ASSIGN
    :    '='
        { keywordsList.add("="); }
    ;    

IDENTIFIER:
  tok =Identifier
  {  
   //System.out.println("Identifier: " + $tok.text);
  }
  ;  

fragment
Identifier 
    :   (Letter (Letter|JavaIDDigit)*);

fragment
Letter
    :  '\u0024' |
       '\u0041'..'\u005a' |
       '\u005f' |
       '\u0061'..'\u007a' |
       '\u00c0'..'\u00d6' |
       '\u00d8'..'\u00f6' |
       '\u00f8'..'\u00ff' |
       '\u0100'..'\u1fff' |
       '\u3040'..'\u318f' |
       '\u3300'..'\u337f' |
       '\u3400'..'\u3d2d' |
       '\u4e00'..'\u9fff' |
       '\uf900'..'\ufaff'
    ;

fragment
JavaIDDigit
    :  '\u0030'..'\u0039' |
       '\u0660'..'\u0669' |
       '\u06f0'..'\u06f9' |
       '\u0966'..'\u096f' |
       '\u09e6'..'\u09ef' |
       '\u0a66'..'\u0a6f' |
       '\u0ae6'..'\u0aef' |
       '\u0b66'..'\u0b6f' |
       '\u0be7'..'\u0bef' |
       '\u0c66'..'\u0c6f' |
       '\u0ce6'..'\u0cef' |
       '\u0d66'..'\u0d6f' |
       '\u0e50'..'\u0e59' |
       '\u0ed0'..'\u0ed9' |
       '\u1040'..'\u1049'
   ;

WS  :  (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN; skip();}
    ;

When I try to parse the line :

public final int inch = this.getValue();

Then the rule 'VAR_VALUE -> IDENTIFIER', also consumes the "this" keyword, which is undesirable, since keywords also be collected into a separate list.

Is there any trick/provision in Antlr grammar to match the keywords by itself rule without effecting the other functionality like "IDENTIFIER"?

like image 220
Kishore_2021 Avatar asked Mar 09 '23 21:03

Kishore_2021


1 Answers

Your problem is indeed caused by the misunderstanding of what belongs in lexer and what belongs in parser:

  • Lexer's job is to determine which words the stream of characters represent
    • e.g. that this is a THIS, 0 is a NUMBER and that is an IDENTIFIER
  • Parser's job is to determine whether the sequence of words emitted from lexer conform to the given language, that is, whether the "sentence" made of those word makes sense
    • e.g. that declaration consists of possible modifiers, a type, and a list of identifiers

Since lexer's job is to determine which words are on the input, it processes the input and looks for longest valid match (in ANTLR, if two or more rules accept same input, the topmost one in source grammar wins). Not for any "most specific", but simply the longest one.

Example:

  • Input t
    • Can be THIS or IDENTIFIER
  • Input h
    • Still can be THIS or IDENTIFIER
  • Input a
    • Can no longer be THIS, only IDENTIFIER is possible
  • Input t
    • IDENTIFIER for sure
  • Input .
    • No longer matches IDENTIFIER, so that will be matched as IDENTIFIER and the last input . will be matched as a new start of next token

And another example:

  • Input t, h, i, s
    • Can be matched as either THIS or IDENTIFIER whole time
  • Input .
    • Can no longer be matched by anything, so this will be matched as THIS (topmost matching rule) rather than IDENTIFIER and . will start a new token

And now to the important part - as long as a lexer rule is referenced from another lexer rule, it's considered to be merely a fragment of the referencing lexer rule. This means that matching it won't emit a new token, and also that it won't trigger any decisions between multiple matching tokens at the end of the fragment's match. Since this can indeed be matched by IDENTIFIER rule, the whole declaration conforms to the V_DECLARATION lexer rule - so unless there's another lexer rule that can match at least the same length of input and is earlier in the grammar than this rule, this rule will apply.

You didn't provide any rule referencing THIS so we don't know how exactly this plays out in your grammar, but the obvious cause is that lexer can match longer input or with earlier rule than anything that uses THIS rule.

like image 155
Jiri Tousek Avatar answered Apr 27 '23 03:04

Jiri Tousek