How to achieve Perl regular expression ^ and $ in the ANLTR4 lexer? ie. to match the start of a line and end of a line without consuming any character.
I am trying to use ANTLR4 lexer to match a # character at the start of a line but not in the middle of a line For example, to isolate and toss out all C++ preprocessor directives regardless of which directive it is while disregard a # inside a string literal. (Normally we can tokenize C++ string literals to eliminate a # appearing in the middle of a line but assuming we're not doing that). That means I only want to specify # .*? without bothering #if #ifndef #pragma, etc.
Also, the C++ standard allows whitespace and multi line comments right before and after the # e.g.
/* helo
world*/ # /* hel
l
o
*/ /*world */ifdef .....
is considered a valid preprocessor directive appearing on a single line. (the CRLFs inside the ML COMMENTs are tossed)
This's what I am doing currently:
PPLINE: '\r'? '\n' (ML_COMMENT | '\t' | '\f' |' ')* '#' (ML_COMMENT | ~[\r\n])+ -> channel(PPDIR);
But the problem is I have to rely on the existence of a CRLF before the # and toss out that CRLF altogether with the directive. I need to replace the CRLF tossed out by the CRLF of this directive line so I've to make sure the directive is terminated by a CRLF.
However, that means my grammar cannot handle a directive appearing right at the start of file (i.e. no preceding CRLF) or preceded by an EOF without terminating CRLF.
If the Perl style regex ^ $ syntax is available, I can match the SOL/EOL instead of explicitly matching and consuming CRLF.
You can use semantic predicates for the conditions.
PPLINE
: {getCharPositionInLine() == 0}?
(ML_COMMENT | '\t' | '\f' |' ')* '#' (ML_COMMENT | ~[\r\n])+
{_input.LA(1) == '\r' || _input.LA(1) == '\n'}?
-> channel(PPDIR)
;
You could try having multiple rules with gated semantics (Different lexer rules in different state) or with modes (pushMode -> http://www.antlr.org/wiki/display/ANTLR4/Lexer+Rules), having an alternative rule for the beginning of the file and then switching to the core rules when the directives end, but it could be a long job.
Firstly, perhaps, I would try if really there are problems in parsing #pragma/preprocessor directives without changing anything, because for example if the problem of finding a # is it could be present in strings and comments, then just by ordering the rules you should be able to direct it to the right case (but this could be a problem for languages where you can put directives in comments).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With