I have the following grammar for ANTLR 4:
grammar Pattern;
//parser rules
parse : string LBRACK CHAR DASH CHAR RBRACK ;
string : (CHAR | DASH)+ ;
//lexer rules
DASH : '-' ;
LBRACK : '[' ;
RBRACK : ']' ;
CHAR : [A-Za-z0-9] ;
And I'm trying to parse the following string
ab-cd[0-9]
The code parses out the ab-cd
on the left which will be treated as a literal string in my application. It then parses out [0-9]
as a character set which in this case will translate to any digit. My grammar works for me except I don't like to have (CHAR | DASH)+
as a parser rule when it's simply being treated as a token. I would rather the lexer create a STRING
token and give me the following tokens:
"ab-cd" "[" "0" "-" "9" "]"
instead of these
"ab" "-" "cd" "[" "0" "-" "9" "]"
I have looked at other examples, but haven't been able to figure it out. Usually other examples have quotes around such string literals or they have whitespace to help delimit the input. I'd like to avoid both. Can this be accomplished with lexer rules or do I need to continue to handle it in the parser rules like I'm doing?
In ANTLR 4, you can use lexer modes for this.
STRING : [a-z-]+;
LBRACK : '[' -> pushMode(CharSet);
mode CharSet;
DASH : '-';
NUMBER : [0-9]+;
RBRACK : ']' -> popMode;
After parsing a [
character, the lexer will operate in mode CharSet
until a ]
character is reached and the popMode
command is executed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With