I've defined multiple lexer rules that potentially matches the same character sequence. For example:
LBRACE: '{' ;
RBRACE: '}' ;
LPARENT: '(' ;
RPARENT: ')' ;
LBRACKET: '[' ;
RBRACKET: ']' ;
SEMICOLON: ';' ;
ASTERISK: '*' ;
AMPERSAND: '&' ;
IGNORED_SYMBOLS: ('!' | '#' | '%' | '^' | '-' | '+' | '=' |
'\\'| '|' | ':' | '"' | '\''| '<' | '>' | ',' | '.' |'?' | '/' ) ;
// WS comments*****************************
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
ML_COMMENT: '/*' .* '*/' {$channel=HIDDEN;};
SL_COMMENT: '//' .* '\r'? '\n' {$channel=HIDDEN;};
STRING_LITERAL: '"' (STR_ESC | ~( '"' ))* '"';
fragment STR_ESC: '\\' '"' ;
CHAR_LITERAL : '\'' (CH_ESC | ~( '\'' )) '\'' ;
fragment CH_ESC : '\\' '\'';
My IGNORED_SYMBOLS and ASTERISK match /, " and * respectively. Since they're placed (unintentionally) before my comment and string literal rules which also match /* and ", I expect the comment and string literal rules would be disabled (unintentionally) . But surprisely, the ML_COMMENT, SL_COMMENT and STRING_LITERAL rules still work correctly.
This is somewhat confusing. Isn't that a /, whether it is part of /* or just a standalone /, will always be matched and consumed by the IGNORED_SYMBOLS first before it has any chance to be matched by the ML_COMMENT?
What is the way the lexer decides which rules to apply if the characters match more than one rule?
A lexer (often called a scanner) breaks up an input stream of characters into vocabulary symbols for a parser, which applies a grammatical structure to that symbol stream.
ANTLR or ANother Tool for Language Recognition is a lexer and parser generator aimed at building and walking parse trees. It makes it effortless to parse nontrivial text inputs such as a programming language syntax.
You should include an explicit EOF at the end of your entry rule any time you are trying to parse an entire input file. If you do not include the EOF , it means you are not trying to parse the entire input, and it's acceptable to parse only a portion of the input if it means avoiding a syntax error.
Add the package name that you want to see in the Java file in which the lexer and parser files will be created. Add the Language in which you want the output like Java , Python etc. Tick the generate parser tree listener and generate tree visitor if you want to modify the visitor. Now the configuration is done.
What is the way the lexer decides which rules to apply if the characters match more than one rule?
Lexer rules are matched from top to bottom. In case two (or more) rules match the same number of characters, the one that is defined first has precedence over the one(s) later defined in the grammar. In case a rule matches N
number of characters and a later rule matches the same N
characters plus 1 or more characters, then the later rule is matched (greedy match).
Take the following rules for example:
DO : 'do';
ID : 'a'..'z'+;
The input "do"
would obviously be matched by the rule DO
.
And input like: "done"
would be greedily matched by ID
. It is not tokenized as the 2 tokens: [DO:"do"]
followed by [ID:"ne"]
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With