Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ANTLR 4 lexer tokens inside other tokens

Tags:

antlr4

I have the following grammar for ANTLR 4:

grammar Pattern;

//parser rules
parse   : string LBRACK CHAR DASH CHAR RBRACK ;
string  : (CHAR | DASH)+ ;

//lexer rules
DASH    : '-' ;
LBRACK  : '[' ;
RBRACK  : ']' ;
CHAR    : [A-Za-z0-9] ;

And I'm trying to parse the following string

ab-cd[0-9]

The code parses out the ab-cd on the left which will be treated as a literal string in my application. It then parses out [0-9] as a character set which in this case will translate to any digit. My grammar works for me except I don't like to have (CHAR | DASH)+ as a parser rule when it's simply being treated as a token. I would rather the lexer create a STRING token and give me the following tokens:

"ab-cd" "[" "0" "-" "9" "]"

instead of these

"ab" "-" "cd" "[" "0" "-" "9" "]"

I have looked at other examples, but haven't been able to figure it out. Usually other examples have quotes around such string literals or they have whitespace to help delimit the input. I'd like to avoid both. Can this be accomplished with lexer rules or do I need to continue to handle it in the parser rules like I'm doing?

like image 665
Charles Avatar asked May 10 '13 15:05

Charles


1 Answers

In ANTLR 4, you can use lexer modes for this.

STRING : [a-z-]+;
LBRACK : '[' -> pushMode(CharSet);

mode CharSet;

DASH : '-';
NUMBER : [0-9]+;
RBRACK : ']' -> popMode;

After parsing a [ character, the lexer will operate in mode CharSet until a ] character is reached and the popMode command is executed.

like image 79
Sam Harwell Avatar answered Oct 17 '22 16:10

Sam Harwell