Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ANTLR4 Lexer Matching Start of Line End Of Line

Tags:

regex

antlr4

How to achieve Perl regular expression ^ and $ in the ANLTR4 lexer? ie. to match the start of a line and end of a line without consuming any character.

I am trying to use ANTLR4 lexer to match a # character at the start of a line but not in the middle of a line For example, to isolate and toss out all C++ preprocessor directives regardless of which directive it is while disregard a # inside a string literal. (Normally we can tokenize C++ string literals to eliminate a # appearing in the middle of a line but assuming we're not doing that). That means I only want to specify # .*? without bothering #if #ifndef #pragma, etc.

Also, the C++ standard allows whitespace and multi line comments right before and after the # e.g.

   /* helo
world*/  #  /* hel
l
o
*/  /*world */ifdef .....

is considered a valid preprocessor directive appearing on a single line. (the CRLFs inside the ML COMMENTs are tossed)

This's what I am doing currently:

PPLINE: '\r'? '\n' (ML_COMMENT | '\t' | '\f' |' ')* '#' (ML_COMMENT | ~[\r\n])+ -> channel(PPDIR); 

But the problem is I have to rely on the existence of a CRLF before the # and toss out that CRLF altogether with the directive. I need to replace the CRLF tossed out by the CRLF of this directive line so I've to make sure the directive is terminated by a CRLF.

However, that means my grammar cannot handle a directive appearing right at the start of file (i.e. no preceding CRLF) or preceded by an EOF without terminating CRLF.

If the Perl style regex ^ $ syntax is available, I can match the SOL/EOL instead of explicitly matching and consuming CRLF.

like image 671
JavaMan Avatar asked May 05 '13 08:05

JavaMan


2 Answers

You can use semantic predicates for the conditions.

PPLINE
    :   {getCharPositionInLine() == 0}?
        (ML_COMMENT | '\t' | '\f' |' ')* '#' (ML_COMMENT | ~[\r\n])+
        {_input.LA(1) == '\r' || _input.LA(1) == '\n'}?
        -> channel(PPDIR)
    ;
like image 72
Sam Harwell Avatar answered Sep 24 '22 21:09

Sam Harwell


You could try having multiple rules with gated semantics (Different lexer rules in different state) or with modes (pushMode -> http://www.antlr.org/wiki/display/ANTLR4/Lexer+Rules), having an alternative rule for the beginning of the file and then switching to the core rules when the directives end, but it could be a long job.

Firstly, perhaps, I would try if really there are problems in parsing #pragma/preprocessor directives without changing anything, because for example if the problem of finding a # is it could be present in strings and comments, then just by ordering the rules you should be able to direct it to the right case (but this could be a problem for languages where you can put directives in comments).

like image 41
lunadir Avatar answered Sep 24 '22 21:09

lunadir