While working on Antlr 3.5 grammar for Java parsing noticed that 'IDENTIFIER' rule consumes few Keywords in ANTLR Lexer grammar. The Lexer grammar is
lexer grammar JavaLexer;
options {
//k=8;
language=Java;
filter=true;
//backtrack=true;
}
@lexer::header {
package java;
}
@lexer::members {
public ArrayList<String> keywordsList = new ArrayList<String>();
}
V_DECLARATION
:
( ((MODIFIERS)=>tok1=MODIFIERS WS+)? tok2=TYPE WS+ var=V_DECLARATOR WS* )
{...};
fragment
V_DECLARATOR
:
(
tok=IDENTIFIER WS* ( ',' | ';' | ASSIGN WS* V_VALUE )
)
{...};
fragment
V_VALUE
: (IDENTIFIER (DOT WS* IDENTIFIER WS* '(' | ',' | ';'))
;
MODIFIERS
:
(PUBLIC | PRIVATE | FINAL)+
;
PRIVATE
: tok = 'private'
{ keywordsList.add($tok.getText()); }
;
PUBLIC
: tok = 'public'
{ keywordsList.add($tok.getText()); }
;
DOT
: '.'
{ keywordsList.add("."); }
;
THIS
: tok = 'this'
{ keywordsList.add($tok.getText()); }
;
ASSIGN
: '='
{ keywordsList.add("="); }
;
IDENTIFIER:
tok =Identifier
{
//System.out.println("Identifier: " + $tok.text);
}
;
fragment
Identifier
: (Letter (Letter|JavaIDDigit)*);
fragment
Letter
: '\u0024' |
'\u0041'..'\u005a' |
'\u005f' |
'\u0061'..'\u007a' |
'\u00c0'..'\u00d6' |
'\u00d8'..'\u00f6' |
'\u00f8'..'\u00ff' |
'\u0100'..'\u1fff' |
'\u3040'..'\u318f' |
'\u3300'..'\u337f' |
'\u3400'..'\u3d2d' |
'\u4e00'..'\u9fff' |
'\uf900'..'\ufaff'
;
fragment
JavaIDDigit
: '\u0030'..'\u0039' |
'\u0660'..'\u0669' |
'\u06f0'..'\u06f9' |
'\u0966'..'\u096f' |
'\u09e6'..'\u09ef' |
'\u0a66'..'\u0a6f' |
'\u0ae6'..'\u0aef' |
'\u0b66'..'\u0b6f' |
'\u0be7'..'\u0bef' |
'\u0c66'..'\u0c6f' |
'\u0ce6'..'\u0cef' |
'\u0d66'..'\u0d6f' |
'\u0e50'..'\u0e59' |
'\u0ed0'..'\u0ed9' |
'\u1040'..'\u1049'
;
WS : (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN; skip();}
;
When I try to parse the line :
public final int inch = this.getValue();
Then the rule 'VAR_VALUE -> IDENTIFIER', also consumes the "this" keyword, which is undesirable, since keywords also be collected into a separate list.
Is there any trick/provision in Antlr grammar to match the keywords by itself rule without effecting the other functionality like "IDENTIFIER"?
Your problem is indeed caused by the misunderstanding of what belongs in lexer and what belongs in parser:
this
is a THIS
, 0
is a NUMBER
and that
is an IDENTIFIER
Since lexer's job is to determine which words are on the input, it processes the input and looks for longest valid match (in ANTLR, if two or more rules accept same input, the topmost one in source grammar wins). Not for any "most specific", but simply the longest one.
Example:
t
THIS
or IDENTIFIER
h
THIS
or IDENTIFIER
a
THIS
, only IDENTIFIER
is possiblet
IDENTIFIER
for sure.
IDENTIFIER
, so that
will be matched as IDENTIFIER
and the last input .
will be matched as a new start of next tokenAnd another example:
t
, h
, i
, s
THIS
or IDENTIFIER
whole time.
this
will be matched as THIS
(topmost matching rule) rather than IDENTIFIER
and .
will start a new tokenAnd now to the important part - as long as a lexer rule is referenced from another lexer rule, it's considered to be merely a fragment of the referencing lexer rule. This means that matching it won't emit a new token, and also that it won't trigger any decisions between multiple matching tokens at the end of the fragment's match. Since this
can indeed be matched by IDENTIFIER
rule, the whole declaration conforms to the V_DECLARATION
lexer rule - so unless there's another lexer rule that can match at least the same length of input and is earlier in the grammar than this rule, this rule will apply.
You didn't provide any rule referencing THIS
so we don't know how exactly this plays out in your grammar, but the obvious cause is that lexer can match longer input or with earlier rule than anything that uses THIS
rule.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With