Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Characters Matching Multiple Lexer Rules in ANTLR

I've defined multiple lexer rules that potentially matches the same character sequence. For example:

LBRACE:  '{' ;
RBRACE: '}' ;
LPARENT: '(' ;
RPARENT: ')' ;
LBRACKET: '[' ;
RBRACKET: ']' ;
SEMICOLON: ';' ;
ASTERISK: '*'  ;
AMPERSAND: '&'  ;

IGNORED_SYMBOLS:   ('!' | '#' | '%' | '^' | '-' | '+' | '=' | 
                    '\\'| '|' | ':' | '"' | '\''| '<' | '>' | ',' | '.' |'?' | '/'  ) ;


// WS comments*****************************
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
ML_COMMENT: '/*' .* '*/' {$channel=HIDDEN;};
SL_COMMENT: '//' .* '\r'? '\n' {$channel=HIDDEN;};

STRING_LITERAL:  '"' (STR_ESC | ~( '"' ))* '"'; 
fragment STR_ESC:  '\\'  '"'  ; 

CHAR_LITERAL :  '\'' (CH_ESC | ~( '\'' )) '\''  ;  
fragment CH_ESC :  '\\' '\''; 

My IGNORED_SYMBOLS and ASTERISK match /, " and * respectively. Since they're placed (unintentionally) before my comment and string literal rules which also match /* and ", I expect the comment and string literal rules would be disabled (unintentionally) . But surprisely, the ML_COMMENT, SL_COMMENT and STRING_LITERAL rules still work correctly.

This is somewhat confusing. Isn't that a /, whether it is part of /* or just a standalone /, will always be matched and consumed by the IGNORED_SYMBOLS first before it has any chance to be matched by the ML_COMMENT?

What is the way the lexer decides which rules to apply if the characters match more than one rule?

like image 757
JavaMan Avatar asked Sep 24 '11 12:09

JavaMan


People also ask

How does ANTLR lexer work?

A lexer (often called a scanner) breaks up an input stream of characters into vocabulary symbols for a parser, which applies a grammatical structure to that symbol stream.

What is lexer and parser Antlr?

ANTLR or ANother Tool for Language Recognition is a lexer and parser generator aimed at building and walking parse trees. It makes it effortless to parse nontrivial text inputs such as a programming language syntax.

Why should a start rule end with EOF end of file in an Antlr grammar?

You should include an explicit EOF at the end of your entry rule any time you are trying to parse an entire input file. If you do not include the EOF , it means you are not trying to parse the entire input, and it's acceptable to parse only a portion of the input if it means avoiding a syntax error.

How do you write an Antlr grammar?

Add the package name that you want to see in the Java file in which the lexer and parser files will be created. Add the Language in which you want the output like Java , Python etc. Tick the generate parser tree listener and generate tree visitor if you want to modify the visitor. Now the configuration is done.


1 Answers

What is the way the lexer decides which rules to apply if the characters match more than one rule?

Lexer rules are matched from top to bottom. In case two (or more) rules match the same number of characters, the one that is defined first has precedence over the one(s) later defined in the grammar. In case a rule matches N number of characters and a later rule matches the same N characters plus 1 or more characters, then the later rule is matched (greedy match).

Take the following rules for example:

DO : 'do';
ID : 'a'..'z'+;

The input "do" would obviously be matched by the rule DO.

And input like: "done" would be greedily matched by ID. It is not tokenized as the 2 tokens: [DO:"do"] followed by [ID:"ne"].

like image 121
Bart Kiers Avatar answered Sep 28 '22 07:09

Bart Kiers