I am having some troubles in handling whitespace. In the following excerpt of a grammar, I set up the lexer so that the parser skips whitespace:
ENTITY_VAR
    : 'user'
    | 'resource'
    ;
INT : DIGIT+ | '-' DIGIT+ ;
ID : LETTER (LETTER | DIGIT | SPECIAL)* ;
ENTITY_ID : '__' ENTITY_VAR ('_w_' ID)?;
NEWLINE : '\r'? '\n';
WS : [ \t\r\n]+ -> skip; // skip spaces, tabs, newlines
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
fragment SPECIAL : ('_' | '#' );
The problem is, I would like to match against variables names of the form ENTITY_ID such that the matched string does not have any whitespace. It would be sufficient to write it as a lexer rule as I did here, but the thing is that I'd like to do it with a parser rule instead, because I want to have direct access to those two tokens ENTITY_VAR and ID individually from my code, and not squeeze them back together in a whole token ENTITY_ID.
Any ideas, please?
Basically any solution which let me access directly ENTITY_VAR and ID would suit me, both by leaving ENTITY_ID as a lexer rule or moving it to the parser.
There are several approaches I can think of (not in a special order):
ENTITY_ID. See ANTLR4: How to inject tokens for an inspirationENTITY_ID tokens and split them into several other tokens, then pass this stream to the parserENTITY_ID part (=> is error) or not (=> ignore error).Create a "trap" rule like this:
INVALID_ENTITY_ID : '__' WS+ ENTITY_VAR WS? ('_w_' WS? ID)?
                  | '__' WS? ENTITY_VAR WS+ ('_w_' WS? ID)?
                  | '__' WS? ENTITY_VAR WS? ('_w_' WS+ ID)
                  ;
This will catch invalid ENTITY_IDs since it's longer than the parts that will then be also individual tokens.
I'd go with 2, if it doesn't alter the parse in the "non error" case, i.e. no code is interpreted differently by allowing whitespace.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With